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Abstract 

In  regression  analysis  the  response  variable  Y  and  the  predictor 
variables  X.|,...,X  are  often  replaced  by  functions  0 ( Y)  and 

*1^) . ♦  (X  ).  We  discuss  a  procedure  for  estimating  those  functions 

9*  and  <!>*»•..,$*  that  minimize 


e2  = 


•  Wh 

J-l  3  J 


)]2} 


E{[e(Y) 

Var[9(Y)] 

given  only  a  sample  {(y^x^  »•  •  •  »X|cp) ,  1  <_k  <_N}  and  making  minimal 
assumptions  concerning  the  data  distribution  or  the  form  of  the  solution 
functions.  For  the  bivariate  case,  p=l,  9*  and  <p*  satisfy 

p*  =  p(9*,<i>*)  3  max  p[9(Y) ,$(X)]  where  p  is  the  product  moment  corre- 

0<J> 

lation  coefficient  and  p*  is  the  maximal  correlation  between  X  and 
Y.  Our  procedure  thus  also  provides  a  method  for  estimating  the  maximal 
correlation  between  two  variables. 


Work  supported  by  Office  of  Naval  Research  under  contracts  N00014-82-K-0054 
and  N00014-81 -K-0340. 
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1 .  Introduction 


— ^  Nonlinear  transformation  of  variables  is  a  commonly  used  practice 
in  regression  problems.  Two  common  goals  are  stabilization  of  error 
variance  and  symmetrization/ normal ization  of  error  distribution.  A  more 
comprehensive  goal,  and  the  one  we  adopt,  is  to  find  those  transformations 
that  produce  the  best  fitting  additive  model.  Knowledge  of  such  trans¬ 
formations  aid  in  the  interpretation  and  understanding  of  the  relation¬ 
ship  between  the  response  and  predictors. 

Let  Y,X-|,...,X  be  random  variables  with  Y  the  response  and 
Xr...,Xp  the  predictors.  Let  e(Y) ,d>1  (X1 ) . <J>p(Xp)  be  arbitrary 

measurable  functions  of  the  corresponding  random  variables.  The  fraction 

2  P 

of  variance  not  explained  (e  )  by  a  regression  of  e ( Y)  on  £  (X^ ) 

is 

n 

i2, 

tlL -  l 

O  4.1  I  * 

(1.1) 


1-1 


E{[e(Y)  -  l  4>.(X.)]‘} 

2/fl  .  .  v  i*l  1  1 

6  (9  •  ,<{>p )  “ 


Var[6(Y)J 


Then  define  optimal  transformations  as  functions  0*,$*, . . .  ,<{>*  that 
minimize  (1.1):  i.e. 


(1.2) 


e2(0*»$*»..«»4>p) 


min 


0  »4>i » •  ■  •  »^p 


®  (0  » •  •  •  >$_,)  * 


We  show  in  Section  5  that  optimal  transformations  exist  and  satisfy 
a  complex  system  of  integral  equations.  The  heart  of  our  approach  is 
that  there  is  a  simple  iterative  algorithm  using  only  bivariate  condi¬ 
tional  expectations  which  converges  to  an  optimal  solution.  When  the 
conditional  expectations  are  estimated  from  a  finite  data  set,  then  use 
of  the  algorithm  results  in  estimates  of  the  optimal  transformations. 


This  method  has  some  powerful  characteristics.  It  can  be  applied  in 
situations  where  the  response  and/or  the  predictors  involve  arbitrary 
mixtures  of  continuous  ordered  variables  and  categorical  variables  (ordered 
or  unordered).  The  functions  0 ....  »d>  are  real  valued.  If  the 
original  variable  is  categorical,  the  application  of  0  or  4^.  assigns 
a  real  valued  score  to  each  of  its  categorical  values. 

The  procedure  is  nonparametric.  The  optimal  transformation  estimates 
are  based  solely  on  the  data  sample  ((y^**^  »•  •  ^  < N}  with 

minimal  assumptions  concerning  the  data  distribution  and  the  form  of  the 
optimal  transformations.  The  principal  distributional  assumption  is  that 
the  data  are  i.i.d.  In  particular,  we  do  not  require  the  transformation 
functions  to  be  from  a  particular  parameterized  family  or  even  monotone. 

(We  illustrate  below  situations  where  the  optimal  transformations  are  not 
monotone. ) 

For  the  bivariate  case,  p  =1 ,  the  optimal  transformations  0*(Y), 
<t>*(X)  satisfy 

(1.3)  p*(X, Y)  =  p(0*,4>*)  -  max  p[e(Y)  ,4>(X)] 

e,4> 

where  p  is  the  product  moment  correlation  coefficient.  The  quantity 
p*(X,Y)  is  known  as  the  maximal  aowelation  between  X  and  Y,  and  is 
used  as  a  general  measure  of  dependence  (Gebelern  [1947];  see  also  Renyi 
[1959]  and  Sarmanov  [1958A.B]).  The  maximal  correlation  has  the  follow¬ 
ing  properties  (Renyi  [1959]): 

(a)  0  <  p*(X,Y)  <  1 

(b)  p*(X,Y)  =  0  if  and  only  if  X  and  Y  are  independent 
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(c)  If  there  exists  a  relation  of  the  form  u(X)  =  v(Y)  where  u 
and  v  are  Borel -measurable  functions  with  var[u(X)]  >  0, 
then  p*(X,Y)  =  1 . 

Therefore,  in  the  bivariate  case  our  procedure  can  also  be  regarded  as  a 
method  for  estimating  the  maximal  correlation  between  two  variables, 
providing  as  a  by-product  estimates  of  the  functions  9*,  4>*  that  achieve 
the  maximum. 

In  the  next  section,  we  describe  our  procedure  for  finding  optimal 
transformations  using  algorithmic  notation,  deferring  mathematical 
justifications  to  sections  5  and  6.  We  next  illustrate  the  procedure  in 
Section  3  by  applying  it  to  a  number  of  simulated  data  sets  where  the 
optimal  transformations  are  known.  The  estimates  are  surprisingly  good. 

Our  algorithm  is  also  applied  to  the  Boston  housing  data  of  Harrison  and 
Rubinfeld  [1978]  as  listed  in  Belsey,  Kuh  and  Welsch  [1980].  The  trans¬ 
formations  found  by  the  algorithm  generally  differ  from  those  applied  in 
the  original  analysis.  (A  FORTRAN  implementation  of  our  algorithm  is 
listed  in  Appendix  2).  Section  4  presents  a  general  discussion  and  relates 
this  procedure  to  other  empirical  methods  for  finding  transformations. 

Sections  5,  6,  and  Appendix  1  provide  some  theoretical  framework  for 
the  algorithm.  In  Section  5,  under  weak  conditions  on  the  joint  distribu¬ 
tion  of  Y,Xj,...,X  ,  it  is  shown  that  optimal  transformations  exist  and 
are  generally  unique  up  to  a  change  of  sign.  The  optimal  transformations 
are  characterized  as  the  eigenfunctions  of  a  set  of  linear  integral  equa¬ 
tions  whose  kernels  involve  bivariate  distributions.  We  then  show  that 
our  procedure  converges  to  optimal  transformations. 

Section  6  discusses  the  algorithm  as  applied  to  finite  data  sets. 

The  results  are  dependent  on  the  type  of  data  smooth  employed  to  estimate 


the  bivariate  conditional  expectations.  Convergence  of  the  algorithm  is 
proven  only  for  a  very  restricted  class  of  data  smooths.  However,  in  over 
a  thousand  applications  of  the  algorithm  on  a  variety  of  data  sets  using 
three  different  types  of  data  smoothers  only  one  (very  contrived)  instance 
of  non  convergence  has  been  found. 

Section  6  also  contains  proof  of  a  consistency  result.  Under  fairly 
general  conditions,  as  the  sample  size  increases  the  finite  data  trans¬ 
formations  converge  in  a  "weak"  sense  to  the  distributional  space  optimal 
transformations.  Finally,  Appendix  1  contains  a  brief  discussion  of  the 
needed  consistency  properties  of  bivariate  smooths. 

This  paper  is  laid  out  in  two  distinct  parts.  Sections  1-4  give  a 
fairly  non-technical  overview  of  the  method  and  discuss  its  application  to 
data.  Sections  5  and  6  are,  of  necessity,  more  technical,  presenting  the 
theoretical  foundation  for  the  procedure. 


2.  The  Algorithm 

Our  procedure  for  finding  9*,4>* . <f>*  is  iterative.  Assume  a  known 

distribution  for  the  variables  Y,X-| . Xp.  Without  loss  of  generality, 

let  var[e(Y)]  =  1,  and  assume  that  all  functions  have  expectation  zero. 

To  illustrate,  we  first  look  at  the  bivariate  case; 

(2.1)  e2(9,4>)  =  E[e(Y)-*(X)]2 

Consider  the  minimization  of  (2.1)  with  respect  to  8 ( Y)  for  a  given 
function  <j>(X).  The  solution  is 

(2.2)  9 1  (Y)  =  E[<J>(X)  |  Y]/IE[<J>(X)  j  Y]H 
2  1/2 

with  1*1  =  [E( - )  ]  '  .  Next,  consider  the  minimization  of  (2.1)  with 
respect  to  <J>(X)  for  a  given  0(Y).  The  solution  is 

(2.3)  4>'(X)  •  E[e(Y)|X]  . 

Equations  (2.2)  and  (2.3)  form  the  basis  of  an  iterative  optimization 
procedure  involving  alternating  conditional  expectations  (ACE): 

BASIC  ACE  ALGORITHM 
set  0 ( Y )  *  Y/8 YB ; 

ITERATE  UNTIL  e2(9,4>)  fails  to  decrease: 

<D’(X)  «  E[0(Y)  |  X]; 
replace  <j>(X)  with  4> ‘ (X ) ; 
e  *  (Y)  *  E[<j>(X)  |  Y]/lE[<j)(X)  |  Y] I ; 
replace  0(Y)  with  9'(Y); 

END  ITERATION  LOOP; 

0  and  4>  are  the  solutions  0*  and  <{>*; 

END  ALGORTHM; 
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Thls  algorithm  decreases  (2.1)  at  each  step  by  alternatingly  mini¬ 
mizing  with  respect  to  one  function  holding  the  other  fixed  at  its 
previous  evaluation.  Each  iteration  (execution  of  the  iteration  loop) 
performs  one  pair  of  these  single  function  minimizations.  The  process 
begins  with  an  initial  guess  for  one  of  the  functions  (6  =  Y/JYI  above) 
and  ends  when  a  complete  iteration  pass  fails  to  decrease  e  (2.1).  In 
Section  5,  we  prove  that  the  algorithm  converges  to  optimal  transformations 
0*.  $*. 

Now  consider  the  more  general  case  of  multiple  predictors  . Xp. 

We  proceed  in  direct  analogy  with  the  basic  ACE  algorithm;  we  minimize 

(2.4)  e2(e,<j>r...,<j,  )  =  E[0(Y)  -  l  <f>.(X.)]2  , 

i  P  j=1  J  J 

2 

holding  E9  =1,  E8  =  E<|>^  =E<j>p  ■  0,  through  a  series  of  single 
function  minimizations  involving  bivariate  conditional  expectations.  For 
a  given  set  of  functions  ^  (X^ ) ,. . .  ,<tp(X  ),  minimization  of  (2.4)  with 
respect  to  e(Y)  yields 

(2.5)  0'(Y)  -  E[  l  <h(X  )1Y]/IE[  l  d>.(X.)!Y]ll 

i=l  11  i=l  1  1 

The  next  step  is  to  minimize  (2.4)  with  respect  to  ^(X^) . <t>p(Xp) 

given  0(Y).  This  is  obtained  through  another  iterative  algorithm. 

Consider  the  minimization  of  (2.4)  with  respect  to  a  single  function 
W  for  9iven  9(Y)  and  a  given  set  ^  , . . .  ,4>k+1 . .  ,<J>p.  The 

solution  is 

(2.6)  4>'(Xk)  =  E[0(Y)  -  7  ♦1(X1)|Xk]  . 

ifk 

The  corresponding  iterative  algorithm  is  then: 
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S6t  <j>-j  (X-j )  >  •  •  •  >4>p( Xp)  ~  0; 

ITERATE  UNTIL  e2(e,0.| , . . . ,<j>p)  fails  to  decrease; 

FOR  k  *  1  TO  p  DO; 

4>^(Xk)  =  E[0(Y)  -  ^(X.JjX^; 

replace  (X.  )  with  4>MX.  ); 

END  FOR  LOOP; 

END  ITERATION  LOOP; 

<t>1 . <j>p  are  the  solution  functions; 

2 

Each  iteration  of  the  inner  FOR  loop  minimizes  e  (2.4)  with  respect  to 
the  function  ^(X^),  k=l,...,p  with  all  other  functions  fixed  at 
their  previous  evaluations  (execution  of  the  FOR  loop).  The  outer  loop 
is  iterated  until  one  complete  pass  over  the  predictor  variables  (inner 
FOR  loop)  fails  to  decrease  e  (2.4). 

Substituting  this  procedure  for  the  corresponding  single  function 
optimization  in  the  bivariate  ACE  algorithm  gives  rise  to  the  full  ACE 
algorithm  for  minimizing  e  (2.4): 

ACE  ALGORITHM: 

set  9(Y)  =  Y/1YII  and  ^  (X] ) . <j>p(Xp)  =  0; 

ITERATE  UNTIL  e^ ( e , d>^ , .  - . , 4>p )  fails  to  decrease; 

ITERATE  UNTIL  e^(9 , . . .  ,d»p)  fails  to  decrease; 

FOR  k  =1  TO  p  DO: 

<i>k(Xk)  =  E[0(Y)  -  .^1(Xi)|Xk]; 

replace  0.  (X.  )  with  d>|L(X.  ); 

END  FOR  LOOP; 

END  INNER  ITERATION  LOOP; 

9'(Y)  =  E[  l  0.  (X. )  |  Y  ]  /  II E  [  l  0i(X.)|Y]ll 
i=l  1  1  i=l  1  1 

replace  e(Y)  with  0'(Y); 

END  OUTER  ITERATION  LOOP; 

0,<j>1 , . . .  ,4>p  are  the  solutions  0* ,d>* . 0*; 

END  ACE  ALGORTHM; 
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In  Section  5,  we  prove  that  the  ACE  algorithm  converges  to  optimal  trans¬ 
formations. 
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3.  Applications 

In  the  previous  section,  the  ACE  algorithm  was  developed  in  the  con¬ 
text  of  known  distributions.  In  practice,  data  distributions  are  seldom 
known.  Instead,  one  has  a  data  set  .  **x|<p)*  1  <.k  £N>  that  is 

presumed  to  be  a  sample  from  Y,Xj,...,X  .  The  goal  is  to  estimate  the 
optimal  transformation  functions  9 ( Y )  ,<j»1  (X^ ) . <j>p(Xp)  from  the  data. 

This  can  be  accomplished  by  applying  the  ACE  algorithm  to  the  data  wi' 

2 

the  quantity  e  ,  II  II ,  and  the  conditional  expectations  replaced  by 

suitable  estimates.  The  resulting  functions  §*,$*,...,$*  are  then 

taken  as  estimates  of  the  corresponding  optimal  transformations. 

2 

The  estimate  for  e  is  the  usual  mean  squared  error  for  regression, 

2  I  N  P  o 

e  (e.^ . ♦p)  *  JiCe(yk) 

2 

If  g(y.x] ,. . .  ,xp)  is  a  function  defined  for  all  data  values,  then  llgll 
is  replaced  by 

,g2,N  *  I  J192(yk’xkr,,,,xkp)  • 

For  the  case  of  categorical  variables,  the  conditional  expectation  estimates 
are  straightforward: 

E[A|Z-z]  =  l  A  /  l  1 

zj'2  zrz 

where  A  is  a  real  valued  quantity  and  the  sums  are  over  the  subset  of 
observations  having  (categorical)  value  Z=z.  For  variables  that  can 
assume  many  ordered  values,  the  estimation  is  based  on  smoothing  techniques. 
Such  procedures  have  been  the  subject  of  considerable  study  (see,  for 
example,  Gasser  and  Rosenblatt  [1979],  Cleveland  [1979],  Craven  and 
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Wahba  [1979]).  Since  the  smoother  is  repeatedly  applied  in  the  algorithm, 
high  speed  is  desirable,  as  well  as  adaptability  to  local  curvature.  We 
use  a  smoother  employing  local  linear  fits  with  varying  window  width 
determined  by  local  cross-validation  ("super  smoother",  Friedman  and 
Stuetzle  [1982]). 


The  algorithm  evaluates  0  at  all  the  corresponding  data 

values,  i.e.  0*(y)  is  evaluated  at  the  set  of  data  values  {yk>, 
k=l,...,N.  The  simplest  way  to  understand  the  shape  of  the  transforma¬ 
tions  is  by  means  of  a  plot  of  the  function  versus  the  corresponding  data 
values,  that  is,  through  the  plots  of  9*(yk)  versus  yk  and 
versus  the  data  values  of  x-|,...,x  respectively. 

In  this  section,  we  illustrate  the  ACE  procedure  by  applying  it  to 
various  data  sets.  (A  FORTRAN  subroutine  implementing  our  procedure  is 
listed  in  Appendix  2.)  In  order  to  evaluate  performance  on  finite  samples, 
the  procedure  is  first  applied  to  simulated  data  for  which  the  optimal 
transformations  are  known.  We  then  apply  it  to  the  boston  housing  data  of 
Harrison  and  Rubinfeld  [1978]  as  listed  in  Belsey,  Kuh  and  Welsch  [1980], 
contrasting  the  ACE  transformations  with  those  used  in  the  original 
analysis. 

Our  first  example  consists  of  200  bivariate  observations  {(yk,xk), 

1  <_k£200}  generated  from  the  model 


yk  =  exp[sin(xk)  +ek/2] 

with  the  xk  sampled  from  a  uniform  distribution  11(0, 2tt)  and  the  ek 
drawn  independently  of  the  xk  from  a  standard  normal  distribution 
N(0,1).  Figure  la  shows  a  scatterplot  of  these  data.  Figures  lb-id  show 
the  results  of  applying  the  ACE  algorithm  to  the  data.  The  estimated 
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optimal  transformation  §*(y)  is  shown  in  the  plot  lb  of  §*(yk)  versus 
yk,  1  <  k  <  200.  Figure  1c  is  a  plot  of  <£*(xk)  versus  xk.  These 
plots  clearly  suggest  the  transformations  0(y)  =  log(y)  and  4>(x)  = 
sin(x)  which  are  optimal  for  the  parent  distribution.  Figure  Id  is  a 
plot  of  §*(yk)  versus  $*(xk).  This  plot  indicates  a  more  linear  rela¬ 
tion  between  the  transformed  variables  than  that  between  the  untransformed 


ones. 

The  next  issue  we  address  is  how  much  the  algorithm  overfits  the 

data  due  to  the  repeated  smoothings,  resulting  in  inflated  estimates  of 

2  2 

the  maximal  correlation  p*  and  of  R*  =  1  -  e*  .  The  answer,  on  the  simulated 
data  sets  we  have  generated,  is  surprizing  little. 

p 

To  illustrate  this,  we  contrast  two  estimates  of  p*  and  R* 
using  the  above  model.  The  known  optimal  transformations  are  e(Y)  « 
logY,  $(X)  *  sin  X.  Therefore,  we  define  the  direct  estimate  for  p* 
given  any  data  set  generated  as  above  by 


1  N  _  _ 

P  =  n  I  (log  yk -log  y)(sin  xk -sin  x) 
k-1 

a  2  /%  2 

and  R*  *  p*  .  The  ACE  algorithm  produces  the  estimates 


and  R*2  =  1  -e*2  =  p*2.  In  this  model  p*  =  .8165  and  R*2  =  .6667. 

For  100  data  sets,  each  of  size  200,  generated  from  the  above  model, 
the  means  and  standard  deviations  of  the  p*  estimates  are 


p*  direct 
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Figure  la 
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2 

The  means  and  standard  deviations  of  the  R*  estimates  are 

mean  s.d. 

R*2  direct  .664  .031 

ACE  .654  .050 

We  also  computed  the  differences  p*  -p*  and  8*2  -i*2  for  the  100 

data  sets.  The  means  and  standard  deviations  are 

mean  s.d. 

p*-p*  -.006  .015 

8*2-R*2  -.010  .024 

The  above  experiment  was  duplicated  for  smaller  sample  size  N=100. 
In  this  case  we  obtain 

mean  s.d. 

p*-p*  -.005  .027 

-  2  *  2 

R*  -R*  -.007  .044 

Our  next  example  consists  of  a  sample  of  200  triples  Uyk>*ki  »xk2^ » 
1  £k  £200}  drawn  from  the  model  Y  =  X-^  with  X-j  and  generated 
independently  from  a  uniform  distribution  U(-l,l).  Note  that  9(Y)  = 
log(Y)  and  4>.(X.)  =  log  X.  (j  =1,2)  cannot  be  solutions  here  since  Y, 
X-|  and  Xg  all  assume  negative  values.  Figure  2a  shows  a  plot  of 
9*(yk)  versus  y^,  while  Figures  2b  and  2c  show  corresponding  plots  of 
<£f(xk-|)  and  ^xk2^  Olk£200).  All  three  solution  transformation 

functions  are  seen  to  be  double  valued.  The  optimal  transformations  for 
this  problem  are  9*(Y)  =  log|Y|  and  ij>*(X.)  =  log  |X.|  (j=l,2).  The 

J  J  J 

estimates  clearly  reflect  this  structure  except  near  the  origin  where  the 
smoother  cannot  reproduce  the  infinite  discontinuity  in  the  derivative. 


For  our  final  example,  we  apply  the  ACE  algorithm  to  the  Boston 
housing  market  data  of  Harrison  and  Rubinfeld  [1978].  A  complete  listing 
of  these  data  appear  in  Belsey,  Kuh  and  Welsch  [1980].  Harrison  and 
Rubinfeld  used  these  data  to  estimate  marginal  air  pollution  damages  as 
revealed  in  the  housing  market.  Central  to  their  analysis  was  a  housing 
value  equation  which  relates  the  median  value  of  owner-occupied  homes  in 
each  of  the  506  census  tracts  in  the  Boston  Standard  Metropolitan 
Statistical  Area,  to  air  pollution  (as  reflected  in  concentration  of 
nitrogen  oxides)  and  to  12  other  variables  that  are  thought  to  effect 
housing  prices.  This  equation  was  estimated  by  trying  to  determine  the 
best  fitting  functional  form  of  housing  price  on  these  13  variables.  By 
experimenting  with  a  number  of  possible  transformations  of  the  14  varia¬ 
bles  (response  and  13  predictors),  Harrison  and  Rubinfeld  settled  on  an 
equation  of  the  form 

log(MV)  =  +  a2(RM)2  +  a3  AGE 

+  ot^  log(DIS)  +  <*5  log  (RAD)  +  a g  TAX 
+  a?  PTRATIO  +  ag(B-0.63)2 
+  oig  log(LSTAT)  +  a-|g  CRIM  +  ZN 
+  a]2  INDUS  +  a]3  CHAS  +  a14(N0X)p  +  e  . 

A  brief  description  of  each  variable  is  given  in  Table  1.  (For  a  more 
complete  description,  see  Harrison  and  Rubinfeld  [1978],  Table  IV.)  The 
coefficients  a-j,...,Oj4  were  determined  by  a  least  squares  fit  to  mea¬ 
surements  of  the  14  variables  for  the  506  census  tracts.  The  best  value 
for  the  exponent  p  was  found  to  be  2.0,  by  a  numerical  optimization 
(grid  search).  This  "basic  equation"  was  used  to  generate  estimates  for 
the  willingness  to  pay  for  and  the  marginal  benefits  of  clean  air. 
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Variable 

MV 

RM 

AGE 

DIS 

RAD 

TAX 

PTRATIO 

B 

LSAT 

CRIM 

ZN 

INDUS 

CHAS 

NOX 


TABLE  1 

Variables  Used  in  the  Housing  Value  Equation 
of  Harrison  and  Rubinfeld  (1978) 

Definition 

Median  value  of  owner-occupied  homes 

Average  number  of  rooms  in  owner  units 

Proportion  of  owner  units  built  prior  to  1940 

Weighted  distances  to  five  employment  centers  in  the  Boston 
region 

Index  of  accessibility  to  radial  highways 
Full  property  tax  rate  ($/$l 0,000) 

Pupil -teacher  ratio  by  town  school  district 
Black  proportion  of  population 
Proportion  of  population  that  is  lower  status 
Crime  rate  by  town 

Proportion  of  town's  residential  land  zoned  for  lots  greater 
than  25,000  square  feet 

Proportion  of  nonretail  business  acres  per  town 

Charles  River  dummy:  =  1  if  tract  bounds  the  Charles  River; 

=  0  if  otherwise 


Nitrogen  oxide  concentration  in  pphm 


Harrison  and  Rubinfeld  note  that  the  results  are  highly  sensitive  to  the 
particular  specification  of  the  form  of  the  housing  price  equation. 

We  applied  the  ACE  algorithm  to  the  transformed  measurements  (using 
p=2  for  NOX)  appearing  in  the  basic  equation.  To  the  extent  that  these 
transformations  are  close  to  the  optimal  ones,  the  algorithm  will  produce 
results  close  to  linear  functions  0(Z)  =^(Z)  =  •••  =  aZ +8. 

Departures  from  linearity  indicate  transformations  that  can  improve  the 
quality  of  the  fit. 

Figure  3a  shows  a  plot  of  the  solution  response  transformation 
§*(1og  y).  This  function  is  seen  to  have  a  positive  curvature  for 
central  values  of  y,  connecting  two  straight  line  segments  of  different 
slope  on  either  side.  This  suggests  that  the  logarithmic  transformation 
may  be  too  severe.  Figure  3b  shows  the  transformation  §*(y)  resulting 
when  the  ACE  algorithm  is  applied  to  the  original  untransformed  census 
measurements.  This  indicates  that,  if  anything,  a  very  mild  transforma¬ 
tion,  involving  positive  curvature,  is  most  appropriate  for  the  response 
variable. 

Figures  3c-3o  show  the  ACE  transformations  <£*,...,$*3  for  the 
(transformed)  predictor  variables.  The  standard  deviation  a($p  is 
indicated  in  each  graph.  This  provides  a  measure  of  how  strongly  each 
<i>t(x.)  enters  into  the  model  for  8*(y).  (Note  that  <?(§*)  =  1.)  The 

J  J 

two  terms  that  enter  most  strongly  involve  the  number  of  rooms  (Figure  3c) 
and  the  fraction  of  population  that  is  of  lower  status  (Figure  3j).  The 
nearly  linear  shape  of  the  latter  transformation  suggests  that  the  original 
logarithmic  transformation  was  appropriate  for  this  variable.  The  trans¬ 
formation  on  the  number  of  rooms  variable  is  far  from  linear,  however, 
indicating  that  a  quadratic  does  not  adequately  capture  its  relationship 
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to  housing  value.  For  less  than  six  rooms,  housing  value  is  roughly 

independent  of  room  number,  while  for  larger  values  there  is  a  strong 

increasing  linear  dependence.  Among  the  next  three  variables  (in  order 

of  their  contribution  to  the  model),  log  DIS  (Figure  3e),  CRIM 

(Figure  3k),  and  INDUS  (Figure  3m),  only  CRIM  has  a  solution  close  to  a 

straight  line.  The  plots  for  the  remaining  variables  indicate  that 

several  of  them  could,  as  well,  benefit  from  transformations  substantially 

different  from  those  used  to  define  the  basic  equation. 

2 

The  marginal  effect  of  (NOX)  on  median  home  value,  as  captured  by 

a  2 

this  model,  can  be  investigated  by  studying  $*[(N0X)  ]  in  Figure  3o. 

2 

This  curve  is  a  nonmonotonic  function  of  NOX  not  well  approximated  by 

a  linear  (or  monotone)  function.  This  makes  it  difficult  to  formulate  a 

simple  interpretation  of  the  willingness  to  pay  for  clean  air  from  these 

data.  For  low  concentration  values,  housing  prices  seem  to  increase  with 
2 

increasing  (NOX)  ,  whereas  for  higher  values  this  trend  is  substantially 
reversed. 

13  ^ 

Figure  3p  shows  a  scatterplot  of  §*(yt)  versus  £  $*(x.  .).  This 

<  j=l  J  KJ 

plot  shows  no  evidence  of  additional  structure  not  captured  in  the  model 

9*(y)  ■  l  $T(xJ  +  e  . 
j-1  J  0 

*  2 

The  e*  resulting  from  the  use  of  the  ACE  transformations  was  0.11  as 
2 

compared  to  the  e  value  of  0.20  produced  by  the  Harrison  and  Rubinfeld 


[1978]  transformations. 


4.  Discussion 

The  ACE  algorithm  provides  a  fully  automated  method  for  estimating 
optimal  transformations  in  multiple  regression.  It  also  provides  a 
method  for  estimating  maximal  correlation  between  random  variables.  It 
differs  from  other  empirical  methods  for  finding  transformations  (Box  and 
Tidwell  [1962];  Anscombe  and  Tukey  [1963];  Box  and  Cox  [1964];  Kruskal 
[1965];  Draper  and  Cox  [1969];  Fraser  [1967];  Linsey  [1972];  Box  and  Hill 
[1974];  Linsey  [1974];  Wood  [1974];  Mosteller  and  Tukey  [1977];  and  Tukey 
[1982])  in  that  the  "best"  transformations  of  the  response  and  predictor 
variables  are  unambiguously  defined  and  estimated  without  use  of  ad  hoc 
heuristics,  restrictive  distributional  assumptions,  or  restriction  of  the 
transformation  to  a  particular  parametric  family. 

The  algorithm  is  reasonably  computer  efficient.  On  the  Boston  housing 
data  set  comprising  506  data  points  with  14  variables  each,  the  run  took 
12  seconds  of  CPU  time  on  an  IBM  3081.  Our  guess  is  that  this  translates 
into  2.5  minutes  on  a  VAX  11/750  with  FP.  To  extrapolate  to  other  problems, 
use  the  estimate  that  running  time  is  proportional  to  (number  of  variables) 
x  (sample  size). 

A  strong  advantage  of  the  ACE  procedure  is  the  ability  to  incorporate 
variables  of  quite  different  type  in  terms  of  the  set  of  values  they  can 
assume.  The  transformation  functions  ©(y)  ,d>-|  (Xi ) , . . .  ,d>p(xp)  assume 
values  on  the  real  line.  Their  arguments  can,  however,  assume  values  on 
any  set.  For  example,  ordered  real,  periodic  (circularly  valued)  real, 
ordered  and  unordered  categorical  variables  can  be  incorporated  in  the 
same  regression  equation.  For  periodic  variables,  the  smoother  window 
need  only  wrap  around  the  boundaries.  For  categorical  variables,  the 
procedure  can  be  regarded  as  estimating  optimal  scores  for  each  of  their 
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values.  (The  special  case  of  a  categorical  response  and  a  single  cate¬ 
gorical  predictor  variable  is  known  as  canonical  analysis--see  Kendall 
and  Stuart  [1967],  p.568— and  the  optimal  scores  can,  in  this  case,  also 
be  obtained  by  solution  of  a  matrix  eigenvector  problem.) 

In  some  problems  the  analyst  may  wish  to  restrict  §(y)  to  be 
monotone.  For  example,  §(y)  monotone  allows  the  unique  specification  of 
y  given  a  value  of  §(y).  There  is  an  option  in  the  program  (Appendix  2) 
that  allows  this  using  KruskaVs  method  of  finding  closest  monotone  fits 
(see  Kruskal  [1964],  pp.  126-128).  However,  we  advise  running  the  regular 
algorithm  first,  since  lack  of  monotonicity  in  9*(y)  can  provide 
valuable  insight  into  the  structure  of  the  data. 

The  solution  functions  §*(y)  and  $*(x^ ), . . . ,$*(xp)  can  be  stored 
as  a  set  of  values  associated  with  each  observation  (y^,xk^ .... » 

1  <_k  <_N.  However,  since  0(y)  and  cj>( x)  are  usually  smooth  (for 
continuous  y,  x),-  they  can  be  easily  approximated  and  stored  as  cubic 
spline  functions  (deBoor  [1978])with  a  few  knots. 

As  a  tool  for  data  analysis,  the  ACE  procedure  provides  graphical 

output  to  indicate  a  need  for  transformations,  as  well  as  to  guide  in  their 

choice.  If  a  particular  plot  suggests  a  familiar  functional  form  for  a 

transformation,  it  can  be  substituted  for  the  empirical  transformation 

estimate  and  the  ACE  algorithm  can  be  rerun  using  an  option  which  alters 

only  the  scale  and  origin  of  that  particular  transformation.  The  resulting 
2 

e  can  be  compared  to  the  original  value.  We  have  found  that  the  plots 
themselves  often  give  surprising  new  insights  into  the  relationship  between 
the  response  and  predictor  variables. 

As  with  any  regression  procedure,  a  high  degree  of  association  between 
predictor  variables  can  sometimes  cause  the  individual  transformation 
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estimates  to  be  highly  variable  even  though  the  complete  model  is  reasonably 
stable.  When  this  is  suspected,  running  the  algorithm  on  randomly  selected 
subsets  of  the  data,  or  on  bootstrap  samples  (Efron  [1979])  can  assist  in 
assessing  the  variability. 

The  ACE  method  has  generality  beyond  that  exploited  here.  An  imme¬ 
diate  generalization  would  involve  multiple  response  variables  . . . . , Y  . 

The  generalized  algorithm  would  estimate  optimal  transformations 

*  *  *  * 

0-| ,. . .  ,6q,<J>i . .  ,d)p  that  minimize 

EE?.iVV  - 

subject  to  E0£  =  0,  £=!,...,  q,  E<J>j=0,j=l , . . .  ,p 
and  H^e^Y^II2  =  1  • 

This  extension  generalizes  the  ACE  procedure  in  a  sense  similar  to 
that  in  which  canonical  correlation  generalized  linear  regression. 

The  ACE  algorithm  (Section  2)  is  easily  modified  to  incorporate 
this  extension.  An  inner  loop  over  the  response  variables,  analagous 
to  that  for  the  predictor  variables,  replaces  the  single  function 


minimization. 
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5.0  Optimal  Transformations  in  Function  Space 
Introduction 

Define  random  variables  to  take  values  either  in  the  reals  or  in  a 
finite  or  countable  unordered  set.  Given  a  set  of  random  variables 
Y,Xj,...,Xp,  a  transformation  is  defined  by  a  set  of  real  valued 
measurable  functions  )  =  (0,£),  each  function  defined  on  the 

range  of  the  corresponding  random  variables,  such  that 

(5.1)  E0(Y)  =  0  ,  EcJ> J ( X j )  =  0  ,  j  =1 . . 

E02(Y)  <  «  ,  E*j(Xj)  <  -  ,  j=l,...,p 

Use  the  notation 

(5.2)  $(X)  =  f  ♦j(Xj) 

J 

Denote  the  set  of  all  transformations  by  F. 

(5.3)  DEFINITION.  A  transformation  (0*,$*)  is  optimal  for  regression 
if  E(9*)2  *  1,  and 

e*2  =  E[0*(Y)-$*(X)]2  =  inf  {E[0(Y)-*(X)]2;E92=1 } 

(5.4)  DEFINITION.  A  transformation  (9**,£**)  is  optimal  for 

correlation  if  E(0**)2  =  1,  E($**)2  =  1, 

p*  =  E[0**(Y)$**(X)]  =  sup  {E[0(Y)$(X)]2;  E($)2  = 1,E02=1 } 

(5.5)  THEOREM.  If  (9**,£**)  is  optimal  for  correlation ,  then  0*  =  0**, 
=  p*£**  is  optimal  for  regression  and  conversely .  Furthermore 
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PROOF.  Write 


E(e-$)2  =  1  -  E9$  +  E$2 

=  1  -  2E(0$!) /EqF"  +  E$2 


where  $  =  Hence 

(5.6)  E(0-$)2  >  1  -  2p*/E$r  +  E$2 


with  equality  only  if  E0$  =  p*.  The  minimum  of  the  right  side  of  (5.6) 
over  E<J>  is  at  E<j>  =  (p*)  where  it  is  equal  to  1  -  (p*)  .  Then 
(e*)  =  l-(p*)^  and  if  (©**,£**)  is  optimal  for  correlation,  then 
9*  =  e**,  =  p*£**  is  optimal  for  regression.  The  argument  is 

reversible. 


5.1  Existence  of  Optimal  Transformations 

To  show  existence  of  optimal  transformations,  two  additional 
assumptions  are  needed: 

AI.  There  is  no  non-zero  set  of  functions  satisfying  (5.1)  such  that 

0(Y)  +  Zj  9j(xj)  =  0  a. b. 

To  formulate  the  second  assumption,  define 

(5.7)  DEFINITION.  Define  the  Hilbert  spaces  HgfY)^^^) , . . .  ^(Xp) 
as  the  3et8  of  functions  satisfying  (5.1)  with  the  usual  inner  product , 
i.e.}  H«(X.)  is  the  set  of  all  measurable  .  such  that  E<t>.(X.)  =  0, 

^  J  J  J  J 

E4>j(Xj)  <  ®  with  (<J>j,<t>j)  =  E C<J> j ( X j ) q> j ( X j ) ] . 
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AII.  The  conditional  expectation  operators 

E(<tj(Xj)|Y):  H2(Xj)  —  H2(Y)  , 

EC^-CXjOIX.):  H2(Xj)  —  H2(X.)  ,  i  *  j 
E(8(Y)|X.):  H2(Y)  —  H^Xj) 

are  all  compact. 

Condition  All  is  satisfied  in  most  cases  of  interest.  A  sufficient 
condition  is  given  by:  let  X,  Y  be  random  variables  with  joint  density 
fx  y  and  marginals  fx,  fy.  Then  the  conditional  expectation  operator  on 
H2(Y)  -'♦'^(X)  1S  compact  if 

(5.8)  [fXy/fxfY]dxdy  <  » 

(5.9)  THEOREM.  Under  AI  and  All  optimal  transformations  exist. 

Some  machinery  is  needed. 

(5.10)  PROPOSITION.  The  set  of  all  functions  f  of  the  form 

f(Y,X)  *  9 ( Y )  +  Ij9j(Xj)  ,  eeH2(Y),  <t>j  G  h2 ( X j ) 
with  the  inner  product  and  norm 

(g.f)  =  E[gf]  ,  llfll2  =  Ef2  , 

is  a  Hilbert  space  denoted  by  H2>  The  subspace  of  all  functions  <j>  of 
the  form 

«*)  -  ZfyfXj)  ,  ♦jSH2(Xj) 

is  a  closed  linear  subspace  denoted  bu  H2(X).  So  are  H2(Y),H2(Xj),...,H2(X  ). 
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(5.10)  follows  from: 

(5.11)  PROPOSITION.  Under  AI,  All  there  are  constants  0  <  <_  C2  <  00 

such  that 

c1( gen2  +£!J||4>jii2)  i  1  c2( lien2  +l54jii2)  . 

PROOF.  The  right  hand  inequality  is  immediate.  If  the  left  side  does 

not  hold,  we  can  find  a  sequence  f  =9  +  7d)  such  that 

n  n  L  n,j 

H0nB2 +I^ll4>n| jH2  =  It  but  It f n II 2  — »- 0 .  There  is  a  subsequence  n'  such 
w  w 

that  en,  — ►  0,  — ►  d>j  in  the  sense  of  weak  convergence  in 

H2(Y),H2(X1) , . . . tH2(Xp)  respectively. 

Wri  te 


E£Vj<tyVl<V3  ■  E[Vj<*j>E(Vi<)!i>lXj>] 


to  see  that  All  implies  E4>n . j 4>n « ^  ■— ,  i  t  j  and  similarly  for 
E0n '  .  j  *  Furthermore  H<t>.jH  <  lim  M<i>n .  ^  ,  lien  <  lim  ll9nJ.  Thus, 

defining  f  =  0  +  £.<*>. 

J  J 

llfll2  =  110  +  1  Tjm  Ilfn,ll2  =  0 

which  implies,  by  AI,  that  0  =  01  =  --«  =4>p  =  0.  On  the  other  hand. 


Ilf  A2  = 
n 


3nJ2  +  Jfl4>n.j»2  +  pn-<fn',j)  +  .^Vj’V.i* 


Hence,  if  f  =  0,  then  Vim  Ilf  lr  1  1. 


(5.12)  COROLLARY.  If  f  f  in  H 
<J>nj  ^ j  in  H2(Xj),  jal . . 

PROOF.  If  f  =  0  +L*  .  0  +y.(j>. 

n  n  Lj  nj 

1  im  I*  .11  <  «.  Take  n'  such  that  9 
nj 


2,  then  9n  — 0 
and  conversely. 

,  then  by  (5.11), 


w 


in  H2(Y), 


1  Tin  11 9n II  < 

■  0'. ,  and  let 

J 
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f'  *  9'  +Ij4>j-  Then  for  any  g£H2,  (g,fn.) -*-(g' ,f'),  so  (g,f)  * 

(g.f1)  all  g.  The  converse  is  easier. 

(5.13)  DEFINITION.  In  H2j  let  Py,  P.  and  Px  denote  the  projection 
operators  on  H2(Y),  H2(Xj)  and  H2(X)  respectively. 

On  Hg ( X . ) ,  P.,  j  f  i,  is  the  conditional  expectation  operator,  and 

similarly  for  other  subspaces. 

(5.14)  PROPOSITION.  Py  is  compact  on  H2(X)  — »-H2(Y)  and  P^  is  compact 
on  H2(Y)  ->H2(X). 

PROOF.  Take  €  H2(X),  This  implies,  by  (5.12),  that 

4>nj  <i>j-  By  All,  Py4>n  ^  Py^.  so  that  Py^n  Py$.  Now  take 
eeH2(Y),  $  e  h2(X),  then  (0,Py$)  =  (e,<j>)  =  (Px9,<j>).  Thus, 

P^:  H2(Y)— >H2(X)  is  the  adjoint  of  Py  and  hence  compact. 

Now  to  complete  the  proof  of  Theorem  5.9.  Consider  the  functional 

-2  ~  ? 

119-4)1!  on  the  set  of  all  (0,<j>)  with  lien  =  1.  For  any  0,  $ 

B9-$B2  >  »0-Px0!l2 

If  there  is  a  8*  which  achieves  the  minimum  of  H 9-Px9 n 2  over  II0II2  =  1 

O 

then  an  optimal  transformation  is  0*,  Px©*.  On  II 0 II  =  1 

n 0-px@ii 2  =  i  -  »Px0ii2  . 

Let  i  =  {sup  HPx9l  ;  11011  =1).  Take  0p  such  that  H Gnll 2  =  1,  0n  0, 

and  IIPx0nl  — >i.  By  the  compactness  of  Px,  II px9nll  — *- II P XQ 1  =  s. 

Further,  1811  <_  1.  If  181  <  1,  then  for  0*  =  0/101,  we  get  the 
contradiction  IPX8'1  >  s.  Hence  101  =  1  and  (8,  Px8)  is  an  optimal 
transformation. 
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5.2  Characterization  of  Optimal  Transformations 

Define  two  operators  U:  ^(Y)  — ^(Y)  and  V:  H2( X)  — >- Hg( X)  by 

ue  =  pyPxe  ,  v$  =  PxPy$ 

(5.15)  PROPOSITION.  U  and  V  are  compact,  self-adjoint  and  non-negative 
definite.  They  have  the  same  eigenvalues  and  there  is  a  1-1  correspondence 
between  eigenspaces  for  a  given  eigenvalue  specified  by 

$  =  Px0/IIPx0ll  ,  0  =  Py$/IIPy$ll 

PROOF.  Direct  verification. 

Let  the  largest  eigenvalue  be  denoted  by  A,  A  =  Hull  =  II V II .  Then 

(5.16)  THEOREM.  If  Q*,  <j>*  is  an  optimal  transformation  for  regression, 
then 

xe*  =  ue*  ,  x$*  *  v$* 

Conversely,  if  0  satisfies  X8  =  U0,  II 6 II  =  1,  then  0,  Px©  is  optimal 
for  regression.  If  $  satisfies  A$  *  V$.,  then  0  =  Py<t>/HPy4>ll,  and 
A$/IIPy<i>ll  are  optima .  for  regression.  In  addition 

(e*)2  =  1  -  X  . 

PROOF.  Let  0*,  $*  be  optimal.  Then  0*  =  Px0*.  Write 

Il0*-$*ll2  =  1  -2(0*,$*)  +H$*H2 

Note  that  (0*,$*)  =  (0*,Py$*)  <  IIPy$*ll  with  equality  only  if 

0*  =  ePy$*,  e  constant.  Therefore,  8*  =  Py$*/IIPy$*ll .  This  implies 

HPy$*He*  =  ue*  ,  IIPy$*!l$*  =  V$*  , 


so  that  HPy<i>*l  is  an  eigenvalue  X*  of  U,  V.  Computing  gives 
80*-<j>*ll  =  1  -  X*.  Now  take  e  any  eigenfunction  of  U  corresponding  to 

X,  with  II 61  =  1.  Let  $  -  Px0,  then  ||0-<j>ll  =  1-X.  This  shows  that 
9*,  are  not  optimal  unless  X*  =  X.  The  rest  of  the  theorem  is 
straightforward  verification. 

(5.17)  COROLLARY.  If  X  has  multiplicity  one ,  then  the  optimal  trans¬ 
formation  is  unique  up  to  a  sign  change.  In  any  case,  the  set  of  optimal 
transformations  is  finite  dimensional. 

It  appears  that  uniqueness  is  the  general  case. 


5.3  Alternating  Conditional  Methods 

Direct  solution  of  the  equations  X0  =  U0  or  X<j>  =  V<ji  is 

formidable.  Attempting  to  use  data  to  directly  estimate  the  solutions 

is  just  as  difficult.  In  the  bivariate  case,  if  X,  Y  are  categorical, 

then  X0  =  U9  becomes  a  matrix  eigenvalue  problem  and  is  tractable.  This 

is  the  case  treated  in  Kendall  and  Stuart  [1967]. 

The  ACE  algorithm  is  founded  on  the  observation  that  there  is  an 

iterative  method  for  finding  optimal  transformations.  We  illustrate  this 

2 

in  the  bivariate  case.  The  goal  is  to  minimize  !1 9 ( Y ) -<) ( X ) II  with 

II 0 II  ^  s  1.  Denote  P^0  =  E(0|X),  Py4>  =  E(<j>|Y).  Start  with  any  first 
guess  function  0g(Y).  Then  define  a  sequence  of  functions  by 
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and  in  general  ^n+1  =  PX9n’  9n+l  =  PY4>n+l/IIPY4>n+l“  •  It  is  clear  that 

2  . 

at  each  step  in  the  iteration  16-^11  is  decreased.  It  is  not  hard  to 
show  that  in  general,  0n>  <j>  converge  to  an  optimal  transformation. 

The  above  method  of  alternating  conditionals  extends  to  the  general 
multivariate  case.  The  analogue  is  clear;  given  0n,  <j>  ,  then  the  next 
iteration  is 

Vl  =  PX9n  ’  9n+l  =  PY^n+l/IPYVll 

However,  there  is  an  additional  issue:  How  can  Px@  be  computed  using 
only  the  conditional  expectation  operators  P.,  j  =l,...,p?  This  is  done 

J 

by  starting  with  some  function  <}>q  and  iteratively  subtracting  off  the 

projections  of  0-<j>n  on  the  subspaces  HgfX^) . H2(Xp)  unt^  we  9et  a 

function  $  such  that  the  projection  of  0-$  on  each  of  H2(X.)  is  zero. 
This  leads  to 

The  Double  Loop  Algorithm 
The  Outer  Loop 

1.  Start  with  an  initial  guess  0g(Y). 

2.  Put  0n+2  =  Pv^n+l^Y^n+i1'  and  rePeat  until 

convergence. 

Let  P^0g  be  the  projection  of  0g  on  the  eigenspace  E  of  U 
corresponding  to  X.  Then 

(5.18)  THEOREM.  If  II P^0q II  /  0,  define  an  optimal  transformation  by 
e*  =  P £ 0q/ II P e®o  1  j  4>*  =  pxe0‘  llen"e*!l  “*0. 

PROOF.  Notice  that  0p+j  =  U0n/ ttU0n!l .  For  any  n,  ©n  =  an0*+gn,  where 
g  IE.  Because,  if  it  is  true  for  n,  then 
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6n+l  =  (Ve*+u9n)/8anX0*+Ugnll 

and  Ugn  is  1  to  E.  For  any  glE,  llUgll  <  rllgl  where  r  <  X 

Vl  ■  iV,U6nl>  Vl  ’  V®.1'  then 

'®n+ll/an+l  '  "V^n  S  < ) Hgn"/o.„  . 

Thus  l9nl/an  <_  c(r/X)n.  But  B9nl  =  1,  a^+SgnD2s  1,  implying 
2 

— ►  1 .  Since  aQ  >  0,  then  ctn  >  0,  so  an-*l.  Now  use  80n- 

2  2  ^  ^ . 

(l-an)  +»gnll  to  reach  the  conclusion.  Since  1  *  ilpx0n 

£  80n-0*ll,  the  theorem  follows. 

The  Inner  Loop 

1.  Start  with  functions  0,  <j>Q. 

2.  If,  after  m  stages  of  iteration,  the  functions  are 
define,  for  j  *  1,2, . . . ,p, 

♦(;+1>  -  p,(8  - 1  Am)  - 1 

J  J  i>j  J  i<j 

(5.19)  THEOREM.  Let  =  1$*^  •  ^X0-^11  ”^°* 

PROOF.  Define  the  operator  T  by 

T=  (l-Pp)(I-Pp.1,---(I-P1) 

Then  the  iteration  in  the  inner  loop  is  expressed  as 

(5.20)  = 

*  t”(0-9o) 

Write  0-0o  =  0  -  Px@  +  P^0  -  $q.  Noting  that  T(0-Px0)  =  0-Px0, 
becomes 

Vi  =  px9  +Tm(Px6-^o) 


.  Since 


*i  = 
pxe*i 


,  then 


(5.20) 
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The  theorem  is  then  proven  by 

(5.21)  PROPOSITION.  For  any  ^HjOt),  H  A«  -*-0. 

PROOF.  II ( I-P  . )$l! 2  =  ll$ll2  -  IIP  .$ll2  £  ll$ll2.  Thus  II Til  <1.  There  is  no 
J  J 

$  f  0  such  that  IT$li  =  II $11 .  If  there  were,  then  IPj$l  =0,  all  j. 

Then  for  $'  * 

($.$')  =  I  j  ( $  )  s  3  0 

J 

The  operator  T  =  I  +W,  where  W  is  compact.  Now  we  claim  that 
||TmWII  — *0  on  H2(X).  To  prove  this,  let  0  <  y  <  1  and  define 

G(y)  =  sup  {IITW$II/IW$1 ;  !$«  <1,  IIW$II  >y}  • 

$ 

Take  8<j>nB  £  1,  (W$nII  £  y  so  that  ITW<f>nB/BW<j>nl  -*G(y) .  Then 

|$l  <1,  ||W$il  >  y  and  G(y)  *  ITW$I/IW$I.  Thus  G(y)  <  1,  for  all 

0  <  y  <  1  and  is  clearly  non- increasing  in  y.  Then 

||TmW$ll  =  ITWTm'1$ll  £  G(BTm"1W$»)llTn1"1W$ll  . 

Put  yq  ~  1 WB ,  Ym  3  G(Yin_1)Yn-1,  then  lAl<Ym.  But  clearly  Ym— 0. 

The  range  of  W  is  dense  in  HgC X) -  Otherwise,  there  is  a  $'  f  0 
such  that  ($',W$)  =  0,  all  $.  This  implies  (W*$',$)  =  0  or  W*<j>’  =  0. 
Then  II T*$ ' II  =  8$' II  and  a  repetition  of  the  argument  given  above  leads 
to  <j>'  =  0.  For  any  $  and  t  >  0,  take  so  that  il$-W<j>jll  <_  e. 

Then  BTm$*  £  e+BT^Wj^,  which  completes  the  proof. 

There  are  two  versions  of  the  double  loop.  In  the  first,  the  initial 
functions  $q  are  the  limiting  functions  produced  by  the  preceding 
inner  loop.  This  is  called  the  restart  version.  In  the  second,  the 
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initial  functions  are  4>q  =0.  This  is  the  fresh  start  version.  The 
main  theoretical  difference  is  that  a  stronger  consistency  result  holds 
for  fresh  start.  Restart  is  a  faster  running  algorithm,  and  is  embodied 
in  the  ACE  code. 


The  Single  Loop  Algorithm 


The  single  loop  algorithm  combines  a  single  iteration  of  the  inner 
loop  with  an  iteration  of  the  outer  loop.  Thus,  it  is  summarized  by 


1.  Start  with  9g,  =  0. 

2.  If  the  current  functions  are  @n,  $  ,  define  by 


9n'Vl  =  T(W 

3.  Let  8n+1  =  PY^n+i/^PY^n+i'*  •  Run  t0  convergence. 

This  is  a  cleaner  algorithm  than  the  double  loop  and  its  implementa¬ 
tion  on  data  runs  at  least  twice  as  fast  as  the  double  loop  and  requires 
only  a  single  convergence  test.  Unfortunately,  we  have  been  unable  to 
prove  that  it  converges  in  function  space.  Assuming  convergence,  it  can 
be  shown  that  the  limiting  8  is  an  eigenfunction  of  U.  But  giving  condi¬ 
tions  for  8  to  correspond  to  X  or  even  showing  that  8  will  correspond 
to  X  "almost  always"  seems  difficult. 
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6.0  The  ACE  Algorithm  on  Finite  Data  Sets 
Introduction 

The  ACE  algorithm  is  implemented  on  finite  data  sets  by  replacing 
conditional  expectations,  given  continuous  variables,  by  data  smooths. 

In  looking  at  the  convergence  and  consistency  properties  of  the  ACE 
algorithm,  the  critical  element  was  the  properties  of  the  data  smooth 
used.  The  results  are  fragmentary.  Convergence  of  the  algorithm  is  proven 
only  for  a  very  restricted  class  of  smooths.  In  practice,  in  over  1000 
runs  of  ACE  over  a  wide  variety  of  data  sets  and  using  three  different 
types  of  smooths,  we  have  seen  only  one  instance  of  failure  to  converge. 

A  fairly  general,  but  weak,  consistency  proof  is  given.  We  conjecture  the 
form  of  a  stronger  consistency  result. 


6.1  Data  Smooths 

Define  a  data  set  D  to  be  a  set  {x^,...,x^}  of  N  points  in  p 
dimensional  space,  i.e.  x^  =  (x^, . . .  ,xkp)  •  Let  be  the  collection 
of  all  such  data  sets.  For  fixed  D,  define  F(x)  as  the  space  of  all 
real-valued  functions  <t>  defined  on  D,  i.e.  <J>  e  F(x)  is  defined  by 
the  N  real  numbers  (<b(x1),. .  .,d>(xN)>.  Define  F  ( x  ^ ) ,  j«l,...,p  as 
the  space  of  all  real-valued  functions  defined  on  the  set  j *x2j ’ *  *  *  *xn j ^ * 


(6.1)  DEFINITION.  A  data  smooth  S  of  x  on  x.  is  a  mapping 

J 

S:  F(x)  — ►F(x.)  defined  for  every  D  in  V  .  If  <p  €  F(x)  denote  the 

J  \  I 

corresponding  element  in  F(xj)  by  S(<J>|Xj)  and  its  values  by  S ( 0  ( x^ j ) 


Let  x  be  any  one  of  Xj,...,x  .  Some  examples  of  data  smooths  are 


1.  Histogram:  Divide  the  real  axis  up  into  disjoint  intervals  0^}.  If 
xk  e  1^,  define 


S(<t|xk)  = 


4  -m’ 


l 

xei 


i 


2.  Moving  Average:  Fix  M  <  N/2.  Order  the  xk  getting  x^  <  •••  <xN 
(assume  no  ties),  and  corresponding  <|>(xj) , . . .  ,<j>(x  ) .  Put 

S(<f>Uk)  =  ^  JQI!M 

If  M  points  are  not  available  on  one  side,  make  up  the  deficiency  on  the 
other  side. 


3.  Kernel :  Take  K(x)  defined  on  the  reals  with  maximum  at  x  =  0.  Then 

S<*l*k>  ■I*<*1>«Vxk)/I  K<Vxk> 

m  sc 

4.  Regression:  Fix  M  and  order  xk  as  in  (2)  above.  At  xk>  regress 
the  values  of  $(xk_M). • . .,0(xk+M)  excluding  <}>(xk)  on  xk_M* •  •  *  »xk+M 
excluding  xk,  getting  a  regression  line  L(x).  Put  S(o|xk)  *  L(xk). 

If  M  points  are  not  available  on  each  side  of  xk  make  up  the  deficiency 
on  the  other  side. 


5.  Supersmoother:  See  Friedman  and  Stuetzle  [1982]. 

Some  properties  that  are  relevant  to  the  behavior  of  smoothers  are  given 
below.  These  properties  hold  only  if  they  are  true  for  all  D  € 

Linearity.  A  smooth  is  linear  if 

S(a<J)j  +  S<t>2 )  =  aS$j  +  SScbg 

for  all  $2  €  F(x)  and  constants  a,  3. 
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Constant  Preserving.  If  <fi  £  F(x)  is  constant,  <j>  =  c,  then  S<j>  =  c. 

To  give  a  further  property,  introduce  the  inner  product  (  )N  on 
F(x)  defined  by 

( d5 » d5  ”  n"  (xk) 

and  the  corresponding  norm  II  II 

Boundedness.  S  is  bounded  by  M  if 

IS<f»lN  <  Ml4»HN  ,  all  <p  e  F(x) 

In  the  examples  of  smooths  given  above,  all  are  linear,  except 
supersmoother.  This  implies  they  can  be  represented  as  an  N*N  matrix 
operator  varying  with  D.  All  are  constant  preserving.  Histograms  and 
moving  average  are  bounded  by  one.  Regression  is  unbounded  due  to  end 
effects,  but  in  the  appendix  we  introduce  a  modified  regression  smooth 
that  is  bounded  by  2.  Supersmoother  is  bounded  by  2.  The  bound  for 
kernel  smooths  is  more  complicated. 


6.2  Convergence  of  ACE 

Let  the  data  be  of  the  form  (yk,xk)  =  (yk>xki» * •  •  »xkp)  >  k  =  l,...,N. 

Assume  tha\.  y  =  x^  =  ---  =  xp  =  0.  Define  smooths  . Sp  where 

S i  :  F(y,x)  — ►  F(y )  and  S ^ :  F(y,x)  — ►  F(x. ) .  Let  H2(y,x)  be  the  set  of 
all  functions  in  F(y,x)  with  zero  mean  and  H2(y),  H2(xj)  the  corres¬ 
ponding  subspaces. 

It  is  essential  to  modify  the  smooths  so  that  the  resulting  functions 
have  zero  means.  This  is  done  by  subtracting  the  mean;  thus  the  modified 
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3 


S.  is  defined  by 

J 


(6.2) 


Sjd>  =  Sj<j>  -  s j<j> 


Henceforth,  we  use  only  modified  smooths  and  assume  the  original  smooth 
to  be  constant  preserving,  so  that  the  modified  smooths  take  constants 
into  zero. 

The  ACE  algorithm  is  defined  by 

1.  9(0>(yk).yk,  ^°>(xkj)iO. 

The  Inner  Loop 

2.  At  the  n  stage  of  the  outer  loop,  start  with  0^,4>^. 

J 

For  every  m  >_  1  and  j  *1 . p  define 

♦l1"*11  •  s,(e(n>  -  l  -  I  *<m>) 

J  J  i<j  i>j 


Keep  increasing  m  until  convergence  to  <J>.. 

J 


The  Outer  Loop 

3.  Set  6^n+1^  =  S  (L<f.)/IS  (y.<j>.)lN,  go  back  to  the  inner  loop 

J  J  J  J  J  J 

with  =  <j> . .  Continue  until  convergence. 

J  J 

r 

To  formalize  this  algorithm,  introduce  the  space  H2( 0 ,4>)  with 
elements  (0,4^, . . . ,4>p) ,  9GH2(y),  ancJ  subspaces  H2 ( 8 ) 

with  elements  (0,0,0, ... ,0)  =  0  and  H2($)  with  elements  (0,$j,...,$  ) 
=  $. 

For  f-  (fo»fi,...,fp)  in  H2(0,$)  define  S^:  H2(9,$)  — *-H2(0,$) 

by 


(y>i 


0  , 


j  t  i 


f,  +S .(  I  f.)  ,  j  =i 
J  J  if  j 


! 
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Starting  with  0  =  (0,0,0, ...  ,0) ,  =  (0,4>[m) , . . .  )  one  complete 

cycle  in  the  inner  loop  is  described  by 

(6.3)  §-${m+1)  =  (I-Sp)(I-Sp_1)..-(I-S1)(6-$(m))  . 

Define  T  on  — ►H2(9»$)  as  the  product  operator  in  (6.3).  Then 

(6.4)  =  0  -Tm(0-^O))  . 

If,  for  a  given  0,  the  inner  loop  converges,  then  the  limiting  <j> 
satisfies 

(6.5a)  Sj(0-$)  =  0  ,  j  =  1 . p  . 

That  is,  the  smooth  of  the  residuals  on  any  predictor  variable  is  zero. 
Adding 

(6.5b)  0  *  Sy$/IISy$HN 

to  (6.5a)  gives  a  set  of  equations  satisfied  by  the  estimated  optimal 
transformations. 

Assume,  for  the  remainder  of  this  section,  that  the  smooths  are 
linear.  Then  (6.5a)  can  be  written  as 

(6.6)  Sj4>  =  Sj9  ,  j  = 1, . . . ,p  . 

Let  sp(S.)  denote  the  spectrum  of  the  matrix  S..  Assume  1  ^  sp(S-). 

v  J  J 

(For  constant  preserving  smooths  1  is  always  in  the  spectrum,  but  not 
for  modified  smooths.)  Define  matrices  A.  by  A.  =  S.(I-S.)~*  and 

J  J  J  J 

the  matrix  A  as  LA..  Assume  further  that  -1  ^  sp(A).  Then  (6.6) 

J  J 

has  the  unique  solution 

=  Aj(  I+A)_10  ,  j  =l,...,p 


(6.7) 
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The  element  $  =  (0,$j,...,$  )  given  by  (6.7)  will  be  denoted  by  PS. 
Rewrite  (6.3)  using  ( I-T) (8-P0)  =  0  as 

(6.8)  ^  =  pe  -ftpe-j^) 


Am 

Therefore,  the  inner  loop  converges  if  it  can  be  shown  that  T  f— *-0  for 
all  f  e  Hg( d>) -  What  we  can  show  is 

(6.9)  THEOREM.  If  det[I+A]  f  0  and  if  tine  spectral  radii  of  S^,...,S 
are  all  lees  than  one ,  a  necessary  and  sufficient  condition  for  rv  — .o 
for  all  f  £  Hg^)  is  that 

P  1 

(6.10)  det[Xl -n(I-S.A)  (I-S .)! 

^  J  J 

have  no  zeroes  in  |X|  >_  1  except  X  =  1. 


PROOF.  For  ^f— *>0,  all  f  €H2($),  it  is  necessary  and  sufficient 
that  the  spectral  radius  of  T  be  less  than  one.  The  equation  Tf  =  Xf 
in  component  form  is 


(6.11) 


Xf  • 


Fi  =  -M*  l  +  l  V  *  j  . p  • 

J  J  i<j  1  i>j  1 
Let  s  =  £f.  and  rewrite  (6.11)  as 


(6.12)  (Xl-S.)f.  =  S.((l-X)  l  f.  -s)  . 

j  j  j  i<j 


If  X  =  1,  (6.12)  becomes  (I-S-)f. 

J  J 

this  implies  s  =  0,  and  hence  f. 

J 

as  an  eigenvalue  of  T.  For  X  f  1, 
the  spectral  radii  of  the  S.,  j  =1 

J 

Then  fj  =  (9j+1-9j)/( 1-X) ,  so 


=  -S.s  or  s  =  -As.  By  assumption, 

J 

=  0,  all  j.  This  rules  out  X  =  1 

but  X  greater  than  the  maximum  of 

,...,p,  define  g.  =  (1-X)  \  f.-s. 

J  i<j 


(iI-Sj)(9j+r9j>  •  ('-‘ISjSj 
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(6.13) 


9j+l  '  C-VX)  (I-SJ)9J  • 


Since  gp+1  =  -Xs,  =  -s,  then  (6.13)  leads  to 

(6.14)  Xs  =  (I-Sp/X)"1(I-Sp)---(I-S1/X)_1(I-S1)s 

If  (6.14)  has  no  non-zero  solutions,  then  s  =  0,  g.  =0,  j  =1 . p, 

J 

implying  all  f •  =  0.  Conversely,  if  (6.14)  has  a  solution  s  f  0,  it 

J 

leads  to  a  solution  of  (6.11). 

Unfortunately,  condition  (6.10)  is  difficult  to  verify  for  general 
linear  smooths.  If  the  S-  are  self-adjoint,  non-negative  definite, 

J 

such  that  all  elements  in  the  unmodified  smooth  matrix  are  non-negative, 
then  all  spectral  radii  of  S.  are  less  than  one,  and  (6.10)  can  be  shown 

J 

to  hold  by  verifying  that 


|x|  <  ni(i-s./x)_1(i-s.)i 

i  j  j 


has  no  solutions  X  with  | X j  >  1,  and  then  ruling  out  solutions  with 
[X |  =  1. 

The  only  common  type  of  smooth  satisfying  the  above  conditions  is 
the  histogram  smooth,  a  poor  smooth  to  use  in  implementing  ACE. 

A 

Assuming  that  the  inner  loop  converges  to  P9  ,  then  the  outer  loop 


iteration  is  given  by 


c  pa ( n ) 

6(n+1)  ■ 

"y§  >N 


Put  the  matrix  S^P  =  U,  so  that 
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(6.15) 


,(n+l)  _  Oe' 


ntJ0(n)iL 


If  the  eigenvalue  X  of  U  having  largest  absolute  value  is  real 
and  positive,  then  e^n+^  converges  to  the  projection  of  0^  on  the 
eigenspace  of  X.  The  limiting  0,  P0  is  a  solution  of  (6.5a,b). 
However,  if  X  is  not  real  and  positive,  then  0^  oscillates  and  does 
not  converge.  If  the  smooths  are  self-adjoint  and  non-negative  definite, 

/V 

then  SyP  is  the  product  of  two  self-adjoint  non-negative  definite 
matrices,  hence  has  only  real  non-negative  eigenvalues.  We  are  unable  to 
find  conditions  guaranteeing  this  for  more  general  smooths. 

Thus,  in  spite  of  the  fact  that  ACE  has  invariably  converged  (with 
one  exception)  we  cannot  give  a  general  convergence  proof.  We  conjecture 
that  convergence  holds  only  for  "most"  data  sets  in  a  sense  made  explicit 
in  the  following  section. 


6.3  Consistency  of  ACE 


For  <|>q,<J>i, . . . ,<bp  any  functions  in  Hg(Y) ,H2(Xj) , 
any  data  set  D  e  define  functions  P ^ (d>^  I x j )  by 


..H2(Xp),  and 


(6.16) 


W\|>  ■  EIWW 


Let  <t>..  in  H2(xj)  be  defined  as  the  restriction  of  <J>i  to  the  set  of 
data  values  {xjj,...,x^.}  minus  its  mean  value  over  the  data  values. 

Assume  that  the  N  data  vectors  (y^.x^)  are  independent  samples 
from  the  distribution  of  (Y,Xj, . . . ,X  ) . 


*3 

11 
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(6.17)  DEFINITION.  Let  S^j  be  any  sequence  of  data  smooths.  They 

■J  J 

are  mean  square  consistent  if 


:(N) 


EHS  }"'(  l  *.|x  )-  l  P  U\x  )\\^  0 
J  i*j  J  i*j  J  1  J  M 


ft 


for  all  4>g*  -  •  •  »<f>_  <za  above,  with  the  analogous  definition  for  S 


(N) 


The  m.s.  consistency  of  some  smooths  is  discussed  in  the  next  section. 

Whether  cr  not  the  algorithm  converges,  a  weak  consistency  result 
can  be  given  under  general  conditions.  Start  with  eQ  €  H2(Y).  On  each 
data  set,  run  the  inner  loop  iteration  m  times,  that  is,  define 


♦(nH>  = 


-m 


Then  set 


9(n+l)  _  p  ^(n+l)^jp  ( n+1) 

-m  y  -m  y^m 


N 


Repeat  the  outer  loop  l  times  getting  the  final  functions  eN(y;m,£), 
<{ljN^xj’m,il^ •  130  the  anal°90us  thing  in  function  space  starting  with  eQ, 

getting  functions  whose  restriction  to  the  data  set  D  are  denoted  by 
9(y;m,il),  .  Then 

(6,18)  THEOREM.  If  the  smooths  are  m.s.  consistent  and 

J  J 

uniformly  bounded  as  N  — then 


Ell0Hj(yini,Z)-0(y;  ,m, £) B ^  — ►O  ,  EB cbj N( x^  ;m, JZ. ) ( x^ ;m, 2.)  II ^  — ►O 

If  9*  is  the  optimal  transformation  P^9g/IP^9gll,  0*  =  P^9*,  then  as 
m,  l  — *■«  in  any  way ,  for  the  fresh  start  algorithm 


1 9(  •  ;m, 0-9*1  —0 


UA  •  ;m,2.)-otl  0  . 
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PROOF.  First  note  that  for  any  product  of  smooths  sj^*-*sj^. 


s[.N)--*s_(N)e„  -P,  —P,  eAl  —  0  . 


ijl  ®  ril  rV0^ 


This  is  illustrated  with  sjN^S^0o,  i /j.  Since  e|s^. ©Q-P^eQl ^  —  0, 
then  sj^9q  =  Pj0o +^j  N  w^ere  n^N- There^ore 

S(N)(S(N)  ,  =  s(N)p  q  +  ,(N). 

Si  1  j  V  Si  Pj80  +  Si  \j  ,N 

By  assumption  fl S ^ N ^ ^  ^  _<  Mil ^1^,  where  M  does  not  depend  on  N. 
Therefore  Els5N)<Dj(  nHn— 0-  By  assumption  e|| N) Pj.60-Pi P^egl ^  — 0  so 
that  Elsj^S^eQ-P.P.eJI^o. 


(6.19)  PROPOSITION.  If  8^  is  defined  in  H2(y)  for  all  data  sets  D, 
and  0  €  ^(Y)  such  that 


i0N(y) 


.eixiJj 

Tumi 


l||0NllN  II 011  UN 


PROOF.  Write  0/11011  =  0/1011^  +  8 ( 1  / II 0 II  -1/1101^).  So  two  parts  are 
needed.  First,  to  show  that 


Elie^'ra^N-*°  • 

Second,  that  El8(jp- - )||^  — >-0.  For  the  first  part,  let 

P2  _  1  rr0N(V  e(yk\2  o n  (9N*9)N  v 
N  "  N  ^Te^"  II 911 N  "  U'll6NllNll9llNj  * 


2  2  P  2 

Then  S^  <  4,  so  it  is  enough  to  show  that  SN — >-0  to  get  ES^  — < ►O. 
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Let 

VN  =  I(eN^k)-0^k^2 

=  «eN!l^  +  Hell2  .  2(eN,9)N 
=  (!i9N8-»eiiN)2  +  2(«eiNiieNnN-(eN,9)N) 

2  2 

Both  terms  are  positive,  and  since  EV^— *0,  — *0, 

i  2  2 1 

E(ll8llNll0NllN-(eN,e)N)  — >- 0 .  By  the  law  of  large  numbers  Ej  BeB^-leH  j  — *-0, 
2  P 

resulting  in  SN — >-0. 

Now  look  at 
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as  m,  l  go  to  infinity.  Begin  with 

(6.20)  PROPOSITION.  As  m  — *■«,  Um  — ►U  in  the  uniform  operator  norm. 

PROOF.  Iium9-ue«  =  IIPYTmPx6»  <  iAxetl.  Nowon  H2(Y),  BTmPxR  -*0.  If 
not,  take  enj,  l6mB  =  1  such  that  ifp^l  >  fi,  all  m.  Let  em,  JUq, 
then  px0m“^px9»  and 

B T171 ' pxem  1 "  i  »Tffl,px(0m,-9)J  +  »fn,px9ll 
-  1FV6m'‘9^  +  ^’px911 

By  Proposition  (5.21)  the  right  hand  side  goes  to  zero. 


The  operator  Um  is  not  necessarily  self-adjoint,  but  it  is 
compact.  By  (6.18),  if  0(sp(U))  is  any  open  set  containing  sp(U), 
then  for  m  sufficiently  large  sp(Um)  C  0(sp(U) ) .  Suppose,  for  sim¬ 
plicity,  that  the  projection  Ex  corresponding  to  the  largest  eigenvalue 
X  of  U  is  one-dimensional.  (The  proof  goes  through  if  Ex  is  higher¬ 
dimensional  but  is  more  complicated.)  Then  for  any  open  neighborhood  0 

of  X,  and  m  sufficiently  large,  there  is  only  one  eigenvalue  X  of 

m 

Um  in  °*  Xm~^X  and  the  projection  P^m^  of  Um  corresponding  to 
Xfn  converges  to  P£  in  the  uniform  operator  topology.  Also,  Xm  can 
be  taken  as  the  eigenvalue  of  Um  having  largest  absolute  value.  If 
X*  is  the  second  largest  eigenvalue  of  U,  and  X^  the  eigenvalue  of 
Um  having  the  second  highest  absolute  value,  then  (assuming  E, ■  is 

A 

one-dimensional)  X' — *-X'. 

m 

Write 


W  =  U  -  p[m> 
m  m  E 


W  =  U  -Pr 


so  again  HW^-Wli  -+0.  Now 
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(6.21) 


<eo  -  x>e">6o  *  <8o 

•  X‘PE  >  + 

A 


For  any  e  >  0  we  will  show  that  there  exists  itIq,  2,q  such 


m  >_  mg.  £  _>  Jig 


(6.22) 


1w*qo!|X*<  e  ,  lw\l|XA  <  e 


Take  r  =  (X+X')/2  and  select  iHq  such  that  r  >  max(X',|Xr 
Denote  by  R(X,W  )  the  resolvent  of  W  .  Then 


and 


wm  2  m 


|X|=r 


X  R(X,WjdX 


nr 


I1 <  -X 

m  —  2ir 


I  R(  X  »W  )  I  d  |  X  | 

|X|=r 


where  d|x|  is  arc  length  along  jx|  *  r.  On  |X|  *  r,  for 

II R  ( X ,  W  )ll  is  continuous  and  bounded.  Furthermore  flR(X,W)l 
m  m 

uniformly.  Letting  M(r)  =  max  II R( X ,W) 8 ,  then 

I X  |  =r 

iw£l  <  r4M(r)(l+6JL1) 

m  —  mm' 


where  AS  — *0  as  m  — *■«.  Certainly 
m  m 


flW£I  <  r^M(r)  . 


Fix  6  >  0  such  that  (l+6)r  <  X.  Take  such  that  for 
m  >  max(mQ,mQ),  Xm  _>  (l+6)r.  Then 

and 

IIW^II/X11  <  (^)£M(r)  . 


that  for 


m  >  mQ, 

— »-  IR(X,W)I 
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Now  choose  a  new  mQ  and  ZQ  such  that  (5.6)  is  satisfied. 
Using  (5.6) 


ufco 

it£eni 

m  u 


pe> 

m 

n’Po1 

m 


=  e, 


m,£ 


where  c 


m,£ 


■0  as  m,£  — *■».  Thus 


IUm60' 


Pc  6n 

=  e-  + 1 Eni_? — 

Em,£  I BPe  90l 

m 


PEx60 


and  the  right  side  goes  to  zero  as  m, £—*•». 

The  term  weak  consistency  is  used  above  because  we  have  in  mind  a 
desirable  stronger  result.  We  conjecture  that  for  reasonable  smooths,  the 
set  CN  =  ((Yj.Xj)  , . . .  ,(Yn,Xn);  algorithm  converges)  satisfies  P(CN)— ^1 
and  that  for  the  limit  on  CN  starting  from  a  fixed  eQ, 

Er'c^V6*'^-0  • 

We  also  conjecture  that  such  a  theorem  will  be  difficult  to  prove. 


APPENDIX  1 


Most  Reasonable  Sequences  of  Uniformly  Bounded  Smooths  are  M.S.  Consistent 

If  the  window  size  goes  to  zero  at  the  right  rate  as  N  — *•<»,  then 
most  "reasonable”  smooths  which  utilize  local  smoothing  are  m.s.  consis¬ 
tent.  There  is  a  substantial  literature  on  consistency,  usually  in 
higher  dimensional  spaces.  Stone's  pioneering  paper  [1977]  established 
consistency  for  k-nearest  neighbor  smoothing.  Devroye  and  Wagner  [1980] 
and  independently  Spiegelman  and  Sacks  [1980]  gave  weak  conditions  for 
consistency  of  kernel  smooths.  See  Stone  [1977]  and  Devroye  [1981]  for  a 
review  of  the  literature. 

The  common  definition  of  consistency  is:  given  a  set  of  N-l  inde¬ 
pendent  copies  (Xj,Yj),...,(X|y_j,Ynj_j),  of  (X,Y)  drawn  from  the  same 
bivariate  distribution,  and  <j>€L2(Y),  call  S^  l_2-consi stent  if 
E[S^N^  (X)~E(<f>(Y)  jX)]2  — *  0.  To  see  that  our  definition  is  equivalent,  put 
down  the  fixed  point  (x,y)  and  then  the  other  N-l  random  points 
(x1,y1),...,(xN_1,yN_1).  Now  compute  E[sJ|N^(x)-E(Yjx)]2  =  gN(x).  Our 
definition  of  m.s.  consistency  is  then 

k  W]  J  0 

or  EgN( X)  — ►O. 

Uniform  boundedness  is  a  critical  condition  for  consistency  proofs. 

A  key  element  in  Stone's  proof  is  (put  in  different  form) 

(  A.l  )  PROPOSITION.  Take  data  sets  drawn  from  a  bivariate  distribution 
(N) 

(X,Y),  Sv  '  a  uniformly  bounded  sequence  of  smooths  on  x,  P^  the 
conditional  expectation  operator.  If,  for  a  set  of  functions  {<J>} 
dense  in  H2(Y), 
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e|s(n)U|x)-px(<i>|x)Sn  —  0  , 

( N) 

then  the  S'  '  as  m.s.  consistent.  If,  for  a  set  of  functions  {h } 
dense  in  ^(X), 

(A. 2  )  Eis(N)(h|x)-h(x)||^  —  0 

then  (A. 2  )  holds  for  all  h  £  ^(X). 

The  proof  is  simple  and  is  omitted. 

( Nl 

Assume  S  '  is  linear.  Then 

(A. 3  )  E8s(NVpx<MJ;  1  2EIS(N)(<f>-Px<|>)lljj  +  2EHS(N)Px<j>-Px<|>ll^  . 

If  it  can  be  shown  that  E||S^N^h-hl^  — ►O  for  all  continuous  h  £  ^(X) 

vanishing  off  at  finite  intervals,  and  if  the  first  term  on  the  right 

in  (A. 3  )  goes  to  zero  for  all  <j>  such  that  <  00 ,  then  (A.l  ) 

( Nl 

implies  that  S'  '  is  m.s.  consistent.  This  strategy  works  for  a  wide 
variety  of  smooths. 

To  illustrate,  because  Stone's  results  [1977]  do  not  seem  immediately 
applicable  to  bivariate  regression  smooths,  m.s.  consistency  is  proven 
for  a  modified  regression  smooth  similar  to  supersmoother.  For  x  any 
point,  let  J(x)  be  the  indices  of  the  M  points  in  (x^l  directly 
above  x  plus  the  M  below.  If  there  are  only  M'  <  M  above  (below) 
then  include  the  M+(M-M')  directly  below  (above).  For  a  regression 
smooth 

(  A. 4  )  S($|x)  =  ^ — <x-\} 

ax 

where  $  ,  xx  are  the  averages  of  <J>(y^),  xk  over  the  indices  in  J(x), 
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2 

rx(<M),  ox  the  covariance  between  4>(yk),  xk  and  the  variance  of  xk 
over  the  indices  in  0(x). 

Write  the  second  term  in  (A. 3  )  as 

rx(<M)  (x-xx) 


If  there  are  M  points  above  and  below  in  J(x),  it  is  not  hard  to  show 


that 


<  1. 


This  is  not  true  near  endpoints  where  (x-xx)/ax  can  become  arbitrarily 
large  as  M  gets  large.  This  endpoint  behavior  keeps  regression  from 
being  uniformly  bounded.  To  remedy  this,  define  a  function 


[x] 


x  , 


xl  <  1 


1  1  sgn(s)  ,  |x|  >  1  , 


and  define  the  modified,  regression  smooth  by 


(A. 5  ) 


S(<t>|x)  ■  $  + 


rvU,x)  x-x 
-hr*! 


This  modified  smooth  is  bounded  by  2. 


(A.6  )  THEOREM.  If,  as  N  — *•»,  M— *■<»,  M  log  N/N— ►O  and  P(dx)  has  no 
atoms,  then  the  modified  regression  smooths  are  m.s.  consistent. 


PROOF.  Assume  lloll^  <  °°  and  use  the  inequality  (A. 3  )  with  g(x)  = 
Px(d>  |  x) .  Then 


s(*-9|x)  =  ?JgI(x)U(yj)-g(Xj)»l  • 

X  X 
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The  conditional  expectation  of  [S(<j>-g(x)]2  given  {x. }  is 


(X,-Xv)  X-J 


•j  "XV”  _Xti2 

rr  L  rr  J J 


Thus,  the  first  term  in  (A. 3  )  is  asymptotically  zero.  Now  look  at 
S(h|x)  -h(x),  h  e  HgU),  h  continuous  and  zero  outside  a  finite 
interval ; 

s(h|x)  -  h(x)  -jj,  IjSJ(x)<h(xj>-h<x))f’ 

Then,  for  H(6)  =  max{ |h(x* )-h(x") | ;  |x'-x"|  <<$}, 

[S(h | x)-h(x) ]  <_  j (^(Xj )-h(x) )2] *2 

<  2H(  max  |x.-x|)  . 
jSJ(x)  J 

Then  to  get  E[S(h|x)-h(x)]2  — ►(),  it  is  enough  to  show  that  A 
An  *  max{ |xj-x| ;  Xjej(x)}  converges  in  probability  to  zero.  Take  x 
to  be  a  point  such  that  P[(x,x+e)]  >  0,  P[(x-e,x)]  >  0  for  all  e  >  0. 
The  set  S  of  all  such  points  has  P(S)  *  1.  Then 


{AN>e}  C  {at  most  2M-1  of  (xk>  in  (x-e,x)} 

u  (at  most  2M-1  of  {x^}  in  (x,x+e)}  . 

Using  the  Binomial  distribution  gives  the  bound 

P(AN>e)  <  2NM{(1  -P[{x,x+e)])N-M+(l  -P[(x-e,x)])N_M} 


Holding  c  fixed  with  N  — *■«  and  M  =  o(N/log  N)  results  in 
P(A^>e)— *0,  proving  the  theorem. 
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TABLE  1 

Variables  Used  in  the  Housing  Value  Equation 
of  Harrison  and  Rubinfeld  (1978) 

Variable  Definition 

MV  Median  value  of  owner-occupied  homes 

RM  Average  number  of  rooms  in  owner  units 

AGE  Proportion  of  owner  units  built  prior  to  1940 

DIS  Weighted  distances  to  five  employment  centers  in  the  Boston 

region 

RAD  Index  of  accessibility  to  radial  highways 

TAX  Full  property  tax  rate  ($/$l 0,000) 

PTRATIO  Pupil -teacher  ratio  by  town  school  district 
B  Black  proportion  of  population 

LSAT  Proportion  of  population  that  is  lower  status 
CRIM  Crime  rate  by  town 

ZN  Proportion  of  town's  residential  land  zoned  for  lots  greater 

than  25,000  square  feet 

INDUS  Proportion  of  nonretail  business  acres  per  town 

CHAS  Charles  River  dummy:  =  1  if  tract  bounds  the  Charles  River; 

=  0  if  otherwise 

NOX  Nitrogen  oxide  concentration  in  pphm 
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Appendix  2 

FORTRAN  Implementation  of  ACE  Algorithm 

This  section  presents  a  listing  of  a  FORTRAN  program  implementing  the 
ACE  algorithm.  It  consists  of  six  subroutines.  The  first  two  are  called 
by  the  user  to  perform  functions  (ACE .ACEMOD) ,  the  next  is  a  BLOCK  DATA 
subprogram  in  which  the  values  of  several  internal  parameters  are  initialized, 
and  the  last  three  provide  utilities  used  by  the  first  two  subroutines.  The 
user  interface  is  through  FORTRAN  subroutine  calls.  The  data  and  various 
input  parameters  are  supplied  as  arguments  in  the  calling  sequence.  The 
optimal  transformation  estimates  and  other  output  quantities,  as  well  as 
required  scratch  storage  are  also  passed  as  subroutine  arguments.  In  order 
to  employ  these  routines  it  is  necessary  to  write  a  main  or  calling  program 
that  reads  the  input  data  into  appropriate  arrays,  to  declare  additional 
arrays  for  output  and  scratch  storage,  and  then  to  execute  a  call  to  the 
appropriate  subroutine  (ACE  or  ACEMOD). 

SUBROUTINE  ACE  computes  the  optimal  transformation  estimates  using  the 
double  loop  restart  version  of  the  ACE  algorithm  (see  Sections  2  and  5.3). 
These  estimates  are  stored  as  a  set  of  transformed  values,  one  value  for 
each  observation,  for  the  response  and  each  predictor  variable.  Upon  return 
from  SUBROUTINE  ACE  these  values  are  stored  in  the  arrays  specified  in  the 
subroutine  call.  This  information  can  then  be  passed  to  appropriate  graphics 
routines  for  display,  or  to  function  approximation  routines  for  summarization. 
SUBROUTINE  ACEMOD  can  (optionally)  be  called  to  estimate  new  response  values 
given  a  set  of  predictor  values  and  the  optimal  transformation  estimates  from 
SUBROUTINE  ACE. 
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As  written,  this  program  is  intended  to  be  used  in  conjunction  with  a 
particular  smoothing  subroutine  ("super-smoother")  which  is  listed  in 
Friedman  and  Stuetzle  (1982).  Our  experience  so  far  with  the  ACE  procedure 
has  been  gained  in  this  context.  It  is  possible  to  employ  other  smoothing 
routines  by  properly  interfacing  them  with  the  ACE  code.  ACE  calls  for  the 
smooth  by  the  FORTRAN  statement 

CALL  SUPSMU  ( N, X, Y, W,L, ALPHA, RESPAN, IBIN.SMO, SCR) 
where  the  parameters  have  the  following  meaning: 

N:  number  of  observations  (X,Y)  pairs. 

X(1...N):  ordered  abscissa  values. 

Y(1...N):  corresponding  ordinate  values. 

W(1...N):  weight  for  each  observation. 

L:  abscissa  variable  flag  -  L  =  1  ordered  variable. 

L  *  2  periodic  (circular)  variable. 

ALPHA, RESPAN, IBIN:  miscellaneous  parameters  (can  be  set  through 

COMMON/PARMS/). 

SM0(1...N):  output  smoothed  ordinate  values. 

SCR ( 1 . . .N,l. . .3) :  scratch  array. 


SUBROUTINE  ACE  (P,N,X,Y,W,L, DELRSQ, TX, TY, RSQ, I ERR, M , Z ) 


OPTIMAL  TRANSFORMATIONS  FOR  CORRELATION  AND  MULTIPLE  REGRESSION 
BY  ALTERNATING  CONDITIONAL  ESTIMATES. 


( BREIMAN  AND  FRIEDMAN,  1982) 


CODED  BY: 


J.  H.  FRIEDMAN 
DEPARTMENT  OF  STATISTICS  AND 
STANFORD  LINEAR  ACCELERATOR  CENTER 
STANFORD  UNIVERSITY 
STANFORD,  CA.  94305 


INPUT : 


C  N  :  NUMBER  OF  OBSERVATIONS. 

C  P  :  NUMBER  OF  PREDICTOR  VARIABLES  FOR  EACH  OBSERVATION. 

C  X ( P , N )  :  PREDICTOR  DATA  MATRIX. 

C  Y(N)  :  RESPONSE  VALUE  FOR  EACH  OBSERVATION. 

C  W(N)  :  WEIGHT  FOR  EACH  OBSERVATION. 

C  L ( P+1 )  :  FLAG  FOR  EACH  VARIABLE. 

C  L ( 1 )  THROUGH  L(P)  :  PREDICTOR  VARIABLES. 

C  L ( P+1 )  :  RESPONSE  VARIABLE. 

C  L ( I )=0  =  >  ITH  VARIABLE  NOT  TO  BE  USED. 

C  L ( I ) =1  =>  ITH  VARIABLE  ASSUMES  ORDERED  VALUES. 

C  L ( I ) =2  =>  ITH  VARIABLE  ASSUMES  CIRCULAR  (PERIODIC)  VALUES 

C  L(I )=3  =>  ITH  VARIABLE  TRANSFORMATION  is  TO  BE  MONOTONE. 

C  L(I)=4  =>  ITH  VARIABLE  TRANSFORMATION  IS  TO  BE  LINEAR. 

C  L(I)«5  ■>  ITH  VARIABLE  ASSUMES  CATEGORICAL  VALUES. 

C  DELRSQ  :  TERMINATION  THRESHOLD.  ITERATION  STOPS  WHEN 
C  RSQ  CHANGES  LESS  THAN  DELRSQ  IN  NTERM 

C  CONSECUTIVE  ITERATIONS  (SEE  BELOW  -  DEFAULT,  NTERM=3 ) . 


C  OUTPUT: 

C 

C  TX ( N , P )  :  PREDICTOR  TRANSFORMATIONS. 

C  TX ( J , I )  =  TRANSFORMED  VALUE  OF  ITH  PREDICTOR  FOR  JTH  OBS . 

C  T Y ( N )  =  RESPONSE  TRANSFORMATION. 

C  TY ( J )  =  TRANSFORMED  RESPONSE  VALUE  FOR  JTH  OBSERVATION. 

C  RSQ  =  FRACTION  OF  VARIANCE (TY<Y> ) 

C  P 

C  EXPLAINED  BY  SUM  TX(I)<X(I)>  . 

C  1  =  1 

C  I ERR  :  ERROR  FLAG. 

C  IERR  =  0  :  NO  ERRORS  DETECTED. 

C  IERR  >  0  :  ERROR  DETECTED  -  SEE  FORMAT  STATEMENTS  BELOW. 

C 

C  SCRATCH: 

C 

C  M( N , P+1 ) ,  Z ( N , 7 )  :  INTERNAL  WORKING  STORAGE. 

C  (Z(J,1),  J=1,N)  CONTAIN  (TRANSFORMED)  RESIDUALS  AS  OUTPUT. 

C 

C  NOTE:  THIS  ROUTINE  USES  AS  A  PRIMITIVE  THE  ‘SUPER  SMOOTHER’ 

C  SUPSMU  (SEE  -  FRIEDMAN  AND  STUETZLE  (1982).  SMOOTHING  OF 
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C  SCATTERPLOTS .  STANFORD  UNIVERSITY  STATISTICS  DEPARTMENT 

C  REPORT  ORION006 . ) 

C 

C - - - 

INTEGER  P,PP1,M(N,1),L(1) 

REAL  Y(N) ,X(P,N) ,  W  ( N ) , T Y ( N ) , TX (N,P),Z(N,7), CT (10) 

COMMON  /PARMS/  ITAPE, MAXIT, NTERM, ALPHA, RESPAN, IBIN 

DOUBLE  PRECISION  SM,SV,SW 

IERR=0 

PP1=P+1 

SM=0 . 0 

S  V=SM 

SW=SV 

IF  ( L ( PP1 ) . GT . 0 )  GO  TO  10 
IERR=4 

IF  (ITAPE. GT.O)  WRITE  (ITAPE, 360)  PP1 
RETURN 
10  NP=0 

DO  20  1=1, P 
IF  (L(I).GT.O)  NP=NP+1 
20  CONTINUE 

IF  (NP.GT.O)  GO  TO  30 
IERR=5 

IF  (ITAPE. GT.O)  WRITE  (ITAPE, 370)  P 
RETURN 

30  DO  40  J=1,N 

SM=SM+W(J)*Y(J) 

SV=SV+W(J)*Y(J)**2 

SW=SW+W(J) 

M( J , PP1 )=J 
Z  ( J ,  2  )  =Y  ( J  ) 

40  CONTINUE 

IF  (SW.GT.0.0)  GO  TO  50 
IERR=1 

IF  (ITAPE. GT.O)  WRITE  (ITAPE, 330) 

RETURN 

50  SM=SM/SW 

SV=SV/SW-SM**2 
IF  (SV.LE.0.0)  GO  TO  60 
SV=1 . 0/DSQRT (SV ) 

GO  TO  70 
60  IERR=2 

IF  (ITAPE. GT.O)  WRITE  (ITAPE, 340) 

RETURN 

70  DO  80  J=1,N 

Z(J,1)=(Y(J) -SM) *SV 
80  CONTINUE 

CALL  SORT  (Z(l, 2) ,M(1,PP1) , 1,N) 

DO  100  1=1, P 

IF  (L(I).LE.O)  GO  TO  100 

DO  90  J=1,N 

TX( J , I )=0 . 0 

M( J, I ) = J 

Z ( J , 2 ) =X ( I , J ) 

90  CONTINUE 


4 
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CALL  SORT  (Z(1,2),M(1,I),1,N) 

100  CONTINUE 
RSQ=0.0 
ITER=0 

NTERM=MIN0{NTERM, 10) 

NT=0 

DO  110  1=1 , NTERM 
CT ( I ) =100 . 0 
110  CONTINUE 
120  ITER=ITER+1 
DO  130  J=1 / N 
TY { J  )  =Z  ( J,  1 ) 

130  CONTINUE 
NIT=0 

140  RSQI=RSQ 
NIT=NIT+1 
DO  200  1=1 ,P 
IF  (L( I ) . LE . 0 )  GO  TO  200 
DO  150  J=1 , N 
K=M ( J , I ) 

Z ( J / 1 ) =TY ( K ) +TX ( K ,  I ) 

Z  ( J ,  2  ) =X ( I , K ) 

Z ( J , 7 ) =W ( K ) 

150  CONTINUE 

CALL  SMOTHR  (L ( I ) , N, Z( 1 , 2 ) , Z , Z ( 1 , 7 ) , Z ( 1 , 6 ) , Z ( 1 , 3 ) ) 

SM=0 . 0 

DO  160  J=1 , N 
SM=SM+Z (J,7)*Z(J,6) 

160  CONTINUE 
SM=SM/SW 
DO  170  J=1 »  N 
Z(J,6)=Z(J,6) -SM 
170  CONTINUE 
SV=0.0 

DO  180  J-l.N 

SV=SV+Z ( J , 7 ) * ( Z ( J , 1)-Z(J,6) )  **2 
180  CONTINUE 

SV=1.0-SV/SW 

IF  (SV.LE.RSQ)  GO  TO  200 

RSQ=SV 

DO  190  J=1,N 
K=M( J, I) 

TX ( K,  I ) =Z ( J, 6 ) 

TY (K)=Z(J,1)-Z(J,6) 

190  CONTINUE 
200  CONTINUE 

IF  ( (NP.NE. 1) .AND. ( { RSQ-RSQI . GT . DELRSQ) .AND. ( NIT . LT . MAXIT ) ) )  GOTC 
1  140 

IF  ( RSQ . NE . 0 . 0 • OR. ITER* NE . 1 )  GO  TO  230 
DO  220  1=1, P 

IF  ( L ( I ) .LE.O)  GO  TO  220 
DO  210  J=1 , N 
TX( J , I )=X( I , J) 

210  CONTINUE 
220  CONTINUE 
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230  DO  250  J=1,N 
K=M(  J , PP1 ) 

Z  ( J ,  2  )  =Y  { K ) 

Z(J, 7)=W(K) 

Z(J<1)=0.0 
DO  240  1=1,  P 

IF  (L(I).GT.O)  Z(J, 1)=Z(J, 1) +TX (K, I ) 

240  CONTINUE 
250  CONTINUE 

CALL  SMOTHR  (L ( PP1 ) , N, Z ( 1 , 2 ) , Z , Z ( 1 , 7 ) , Z ( 1 , 6  )  , Z ( 1 , 3  )  ) 

SM=0.0 

SV=SM 

DO  260  J=1 , N 
K=M( J, PP1 ) 

SM=SM+W(K)*Z(J, 6) 

S V=SV+W ( K ) *  Z ( J , 6) *  *  2 
Z ( K, 2 )=Z ( J , 1 ) 

260  CONTINUE 
SM=SM/SW 
SV=SV/SW-SM**2 
IF  (SV.LE.0.0)  GO  TO  270 
SV=1 . 0/ DSQRT ( SV ) 

GO  TO  280 
270  IERR=3 

IF  (ITAPE.GT.O)  WRITE  (ITAPE,350) 

RETURN 

280  DO  290  J =1 , N 
K=M ( J ,  PP1 ) 

TY (K )= (Z ( J , 6 ) -SM) *SV 
290  CONTINUE 
SV=0 . 0 

DO  300  J=1,N 
Z  ( J ,  1 )=TY( J ) -Z ( J , 2  ) 

SV=SV+W(J)*Z( J, 1)**2 
300  CONTINUE 

RSQ=1.0-SV/SW 

IF  (ITAPE.GT.O)  WRITE  (ITAPE,320)  ITER, RSQ 

NT=MOD ( NT , NTERM ) + 1 

CT ( NT )=RSQ 

CMN=100 . 0 

CMX=-100 . 0 

DO  310  1=1, NTERM 

CMN=AMIN 1 ( CMN , CT ( I ) ) 

CMX=AMAX1 ( CMX, CT ( I ) ) 

310  CONTINUE 

IF  ( (CMX-CMN.GT.DELRSQ) .AND. (ITER. LT.MAXIT) )  GOTO  120 


320 

RETURN 
FORMAT ( 

11H 

ITERATION  12, 

23H  R**2  = 

1  -  E**2  =G 1 2 . 4 ) 

330 

FORMAT  ( 

41H 

IERR=1 : 

SUM  OF 

WEIGHTS  (W)  NOT 

POSITIVE. ) 

340 

FORMAT ( 

29H 

IERR=2 : 

Y  HAS 

ZERO  VARIANCE.) 

350 

FORMAT ( 

30H 

IERR=3: 

TY  HAS 

ZERO  VARIANCE.) 

360 

FORMAT ( 

11H 

IERR=4: 

L(I2, 

19H )  MUST  BE  POSITIVE.) 

370 

FORMAT ( 

29H 

IERR=5 : 

AT  LEAST  ONE  L(1)-L(I2, 

19H )  MUST  BE  POST. 

II VE .  ) 
END 
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SUBROUTINE  ACEMOD  ( V, P,N,X, Y, L,TX,TY,M, YHAT, IERR) 


COMPUTES  RESPONSE  ESTIMATES  FROM  THE  MODEL 

-1  P 

YHAT  =  TY  (  SUM  TX(I)<V(I)>  ) 

1  =  1 

USING  THE  TRANSFORMATIONS  CONSTRUCTED  BY  SUBROUTINE  ACE. 

INPUT : 

V(P)  :  VECTOR  OF  PREDICTOR  VALUES. 

P,N,X,Y,L  :  SAME  INPUT  AS  FOR  SUBROUTINE  ACE. 

TX, TY, M  :  OUTPUT  FROM  SUBROUTINE  ACE. 

OUTPUT: 

YHAT  :  ESTIMATED  RESPONSE  VALUE  FOR  V. 

IERR  :  ERROR  FLAG. 

IERR=0:  NO  ERROR  DETECTED. 

IERR=1:  ERROR  DETECTED  -  SEE  FORMAT  STATEMENT  BELOW. 

NOTE:  THIS  ROUTINE  REQUIRES  THAT  THE  RESPONSE  TRANSFORMATION  TY  IS  A 
STRICTLY  (INCREASING  OR  DECREASING)  MONOTONE  FUNCTION  OF  Y,  THAT 
IS  L( P+1 )  =  3  OR  4  IN  THE  CALL  TO  SUBROUTINE  ACE. 


INTEGER  P, PP1 , M (N, 1) ,  L  ( 1 ) , LOW, HIGH , PLACE 
REAL  V(P),X(P,N),Y(N), TY ( N ) , TX ( N , P ) 

COMMON  /PARMS/  ITAPE, MAXIT,NTERM, ALPHA, RESPAN, IBIN 

PP1=P+1 

IERR=0 

IF  (L(PP1 ) . EQ. 3 . OR. L( PP1 ) . EQ. 4)  GO  TO  10 
IERR=1 

IF  ( ITAPE. GT.O)  WRITE  (ITAPE, 140)  PP1 

RETURN 

YH=0 . 0 

DO  80  1=1,  P 

IF  ( L ( I ) . LE . 0 )  GO  TO  80 
VI=V(I) 

IF  (VI.GT.X(I,M(1, I) ) )  GO  TO  20 

PLACE=1 

GO  TO  70 

IF  (VI.LT.X(I,M(N, I) ) )  GO  TO  30 

PLACE=N 

GO  TO  70 

LOW=0 

HIGH=N+1 

IF  ( LOW+1 . GE . HIGH )  GO  TO  60 
PLACE* ( LOW+HIGH ) / 2 
XT=X ( I , M ( PLACE , I ) ) 

IF  (VI.EQ.XT)  GO  TO  70 
IF  (VI.GE.XT)  GO  TO  50 
HIGH=PLACE 
GO  TO  40 
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50  L0W= PLACE 

GO  TO  40 
60  JL=*M  ( LOW ,  I ) 

JH=M( HIGH, I ) 

YH=*YH+TX{  JL,  I )  +  {TX(  JH,  I  )-TX(  JL,  I )  )  *  ( VI-X  ( I ,  JL )  )  /  (X  ( I ,  JH ) -X  ( I ,  JL)  ) 
GO  TO  80 

70  YH=YH+TX(M( PLACE, I) , I) 

80  CONTINUE 

IF  ( YH . GT • TY ( M ( 1 , PP 1 ) ) )  GO  TO  90 
YHAT=Y ( M ( 1 , PP1 ) ) 

RETURN 

90  IF  (YH.LT.TY(M(N, PP1 ) ) )  GO  TO  100 
YHAT=Y(M(N,PP1) ) 

RETURN 
100  LOW=0 

HIGH=N+1 

XT— TY(M(N, PP1 ) ) -TY(M{ 1 ,  PP1 ) ) 

XT=XT/ABS(XT) 

110  IF  ( LOW+1 . GE . HIGH )  GO  TO  130 
PLACE= ( LOW+HIGH ) / 2 

IF  (XT*YH.GE.XT*TY(M{ PLACE, PP1) ) )  GO  TO  120 
HIGH=PLACE 
GO  TO  110 
120  LOW=PLACE 
GO  TO  110 
130  JL=M ( LOW, PP1 ) 

JH=M(HIGH, PP1 ) 

YHAT=>Y(JL)  +  (Y(JH)-Y(JL) ) * ( YH-TY( JL ) ) / ( TY( JH ) -TY ( JL ) ) 

RETURN 

140  FORMAT (  11H  IERR-1 :  L(I2,  55H)  MUST  EQUAL  3  OR  4  -  MONOTONE 

1 RESPONSE  TRANSFORMATION.) 

END 

BLOCK  DATA 

COMMON  /PARMS/  ITAPE, MAXIT, NTERM, ALPHA, RESPAN, IB IN 


THESE  PROCEDURE  PARAMETERS  CAN  BE  CHANGED  IN  THE  CALLING  ROUTINE 
BY  DEFINING  THE  ABOVE  LABELED  COMMON  AND  RESETTING  THE  VALUES  WITH 
EXECUTABLE  STATEMENTS. 

ITAPE  :  FORTRAN  FILE  NUMBER  FOR  PRINTER  OUTPUT. 

( ITAPE. LE.O  =>  NO  PRINTER  OUTPUT.) 

MAXIT  :  MAXIMUM  NUMBER  OF  ITERATIONS. 

NTERM  ;  NUMBER  OF  CONSECUTIVE  ITERATIONS  FOR  WHICH  ESTIMATED 

CORRELATION  MUST  CHANGE  LESS  THAN  DELCOR  FOR  CONVERGENCE. 
ALPHA,  RESPAN,  IBIN  :  SUPER  SMOOTHER  PARAMETERS. 

(SEE  -  FRIEDMAN  AND  STUETZLE,  REFERENCE  ABOVE.) 


DATA  ITAPE, MAXIT, NTERM, ALPHA, RESPAN, IBIN  /6 , 20, 3 , 0 . 1 , 0 . 25, 1/ 
END 

SUBROUTINE  SMOTHR  (L, N, X, Y, W, SMO, SCR) 

REAL  X(N),Y(N),W(N), SMO ( N ) , SCR ( N, 3 ) 

COMMON  /PARMS/  ITAPE, MAXIT, NTERM, ALPHA, RESPAN, IBIN 
DOUBLE  PRECISION  SM,SW 


IF  (L.LT.5)  GO  TO  50 
J=1 

10  J0=J 

SM=W( J)*Y(J) 

SW=W(J) 

IF  (J.GE.N)  GO  TO  30 
20  IF  (X(J+1) .GT.X( J) )  GO  TO  30 
J-J+l 

SM=SM+W( J)*Y(J) 

SW=SW+W( J) 

IF  (J.LT.N)  GO  TO  20 
30  SM=SM/SW 

DO  40  I=J0,J 
SMO ( I ) =SM 
40  CONTINUE 
J=J+1 

IF  (J.LE.N)  GO  TO  10 
GO  TO  240 

50  IF  (L.NE.4)  GO  TO  80 
SM— 0 . 0 
SW=SM 

DO  60  J=1 ,  N 
SM=SM+W( J ) *X ( J ) *Y( J ) 

SW=SW+W ( J ) *X ( J ) *  *  2 
60  CONTINUE 
A=SM/SW 
DO  70  J=1 , N 
SMO(J)=A*X(J) 

70  CONTINUE 
GO  TO  240 

80  CALL  SUPSMU  (N, X, Y, W, L, ALPHA, RESPAN, IBIN, SMO, SCR) 
IF  (L.NE.3)  GO  TO  240 
DO  90  J=1,N 
SCR(J,l)*SMO(J) 

SCR(N-J+1, 2)=SCR(J, 1) 

90  CONTINUE 

CALL  MONTNE  (SCR,N) 

CALL  MONTNE  ( SCR ( 1 , 2 ) , N ) 

SM=0 . 0 
SW=SM 

DO  100  J— 1 ,  N 

SM»SM+(SMO(J)-SCR(J, 1) )**2 
SW=SW+(SMO( J)-SCR(N-J+1, 2 ) )**2 
100  CONTINUE 

IF  (SM.GE.SW)  GO  TO  120 
DO  110  J=l, N 
SMO( J )=SCR( J, 1) 

110  CONTINUE 
GO  TO  140 
120  DO  130  J=1,N 

SMO(J)=SCR(N-J+l, 2) 

130  CONTINUE 
140  J-l 
150  J0=J 

IF  (J.GE.N)  GO  TO  170 
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160  IP  (SMO(J+l) .NE.SMO(J) )  GO  TO  170 
J=\J+1 

IF  (J.LT.N)  GO  TO  160 
170  IF  (J.LE.JO)  GO  TO  190 
A-0 . 0 

IF  (J0.GT.1)  A=0 . 5* (SMO (JO )-SMO( JO-1 ) ) 

B=0 . 0 

IF  (J.LT.N)  B-0.5*(SMO(J+1)-SMO(J) ) 

D=( A+B) / (J-JO ) 

IF  (A. EQ .0.0. OR. B . EQ .0.0)  D=2.0*D 

IF  ( A. EQ .0.0)  A=B 

DO  180  I=J0,J 

SMO ( I ) -SMO ( I ) -A+D* ( I- JO ) 

180  CONTINUE 
190  J=J+1 

IF  (J.LE.N)  GO  TO  150 
J*1 

200  J0=J 

SM*SMO(J) 

IF  (J.GE.N)  GO  TO  220 
210  IF  (X(J+1) .GT.X(J) )  GO  TO  220 
J=J+1 

SM=SM+SMO(J) 

IF  (J.LT.N)  GO  TO  210 
220  SM=SM/ ( J-JO+1 ) 

DO  230  I=J0,J 
SMO ( I ) aSM 
230  CONTINUE 
J*J+1 

IF  (J.LE.N)  GO  TO  200 
240  RETURN 
END 

SUBROUTINE  MONTNE  (X,N) 

REAL  X (N ) 

INTEGER  BB,EB,BR,ER, BL, EL 

BB=0 

EB=BB 

10  IF  (EB.GE.N)  GO  TO  110 
BB=EB+1 
EB=BB 

20  IF  (EB.GE.N)  GO  TO  30 

IF  (X( BB) . NE . X ( EB+1 ) )  GO  TO  30 

EB=EB+1 

GO  TO  20 

30  IF  (EB.GE.N)  GO  TO  70 

IF  (X(EB) .LE.X(EB+1) )  GO  TO  70 

BR=EB+1 

ER=BR 

40  IF  (ER.GE.N)  GO  TO  50 

IF  (X(ER+1) .NE.X(BR) )  GO  TO  50 
ER*ER+1 
GO  TO  40 

PMN= (X ( BB ) * ( EB-BB+1 ) +X ( BR ) * ( ER-BR+1 ) )/(ER-BB+l) 
EB*ER 

DO  60  I-BB , EB 


50 


-76- 


X ( I ) =PMN 
60  CONTINUE 

70  IF  (BB.LE.l)  GO  TO  10 

IF  (X(BB-1 ) .LE.X(BB) )  GO  TO  10 

BL=BB-1 

EL— BL 

80  IF  (BL.LE.l)  GO  TO  90 

IF  (X(BL-l) .NE.X(EL) )  GO  TO  90 

BL=BL-1 

GO  TO  80 

90  PMN= (X ( BB ) * ( EB-BB+1 ) +X { BL) * ( EL-BL+1 ) )/(EB-BL+l ) 

BB=BL 

DO  100  I=BB,EB 
X ( I ) =PMN 
100  CONTINUE 
GO  TO  30 
110  RETURN 
END 
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SUBROUTINE  SORT  (V,A,II,JJ) 


PUTS  INTO  A  THE  PERMUTATION  VECTOR  WHICH  SORTS  V  INTO 
INCREASING  ORDER.  ONLY  ELEMENTS  FROM  II  TO  JJ  ARE  CONSIDERED. 
ARRAYS  IU ( K )  AND  IL(K)  PERMIT  SORTING  UP  TO  2**(K+1)-1  ELEMENTS 


THIS  IS  A  MODIFICATION  OF  CACM  ALGORITHM  #347  BY  R.  C. 
WHICH  IS  A  MODIFIED  HOARE  QUICKSORT. 


SINGLETON, 


DIMENSION  A(JJ),V(1),IU(20),IL(20) 

INTEGER  T , TT 

INTEGER  A 

REAL  V 

M=1 

I=II 

JaJJ 

IF  (I.GE.J)  GO  TO  80 

K=I 

IJ=s(  J+I ) / 2 
T=>A(  IJ ) 

VT~V(IJ) 

IF  (V(I) .LE.VT)  GO  TO  30 
A(IJ)=A(I) 

A(I)=T 

T=A(IJ) 

V(IJ)=V(I) 

V ( I ) =VT 
VT=V ( I J ) 

L*J 

IF  (V(J).GE.VT)  GO  TO  50 
A(IJ)=A(J) 

A(  J )  =>T 
T»A ( I J ) 

V ( IJ )=V( J ) 

V(J)=VT 

VT=V(IJ) 

IF  (V( I). LE.VT)  GO  TO  50 
A( IJ )=A( I ) 

A(I)=T 
T=A( IJ) 

V(IJ)*V(I) 

V ( I ) =VT 
VT=V ( I J ) 

GO  TO  50 
A(L)*A(K) 

A ( K ) =TT 
V(L)«V(K) 

V ( K ) «VTT 
L*L-1 

IF  (V(L).GT.VT)  GO  TO  50 
TT«A(L) 

VTT-V(r.) 

K-K+l 

IF  (V{K) .LT. VT)  GO  TO  60 
IF  (K.LE.L)  GO  TO  40 


IF  (L-I.LE.J-K)  GO  TO  70 

IL(M)=I 

IU(M)=L 

I=K 

M=M+1 

GO  TO  90 

IL(M)=K 

IU(M)=J 

J=L 

M=M+1 

GO  TO  90 

M=M-1 

IF  (M.EQ.O)  RETURN 
I=IL(M) 

J=IU(M) 

IF  (J-I.GT.10)  GO  TO  20 
IF  (I.EQ.II)  GO  TO  10 
1=1-1 
1=1  +  1 

IF  (I.EQ.J)  GO  TO  80 
T=A( 1+1 ) 

VT=V ( 1+1 ) 

IF  (V(I).LE.VT)  GO  TO  100 
K=I 

A(K+1)=A(K) 

V ( K+ 1 ) =V ( K ) 

K=K-1 

IF  (VT.LT.V(K))  GO  TO  110 
A ( K+ 1 ) =T 
V (K+l )=VT 
GO  TO  100 


